LIBRARY  OF  THE 

UNIVERSITY  OF  ILLINOIS 

AT  URBANA-CHAMPAIGN 


bi.0.%4 


IS*'  /1   Report  No.  k 2h 


' r 


y 


PARALLELISM  EXPOSURE  AND  EXPLOITATION  IN  PROGRAMS 

by 
Yoichi  Muraoka 


February,  1971 


LIBF 


NOV    9  1972 

UNIVERSITY  OF  ILLINOIS 

AT  urbawa-champaign: 


DEPARTMENT  OF  COMPUTER  SCIENCE 
UNIVERSITY  OF  ILLINOIS  AT  URBANA-CHAMPAIGN 


URBANA,  ILLINOIS 


Digitized  by  the  Internet  Archive 
in  2013 


http://archive.org/details/parallelismexpos424mura 


PARALLELISM  EXPOSURE  AND  EXPLOITATION  IN  PROGRAMS 


BY 


YOICHI  MURAOKA 
B.Eng.,  Waseda  University,  19^5 
M.S.,  University  of  Illinois,  1969 


THESIS 


Submitted  in  partial  fulfillment  of  the  requirements 

for  the  degree  of  Doctor  of  Philosophy  in  Computer  Science 

in  the  Graduate  College  of  the 

University  of  Illinois  at  Urbana-Champaign,  1971 


Urbana,  Illinois 


iii 


ACKNOWLEDGEMENT 

The  author  would  like  to  express  his  deepest  gratitude  to 
Professor  David  J.  Kuck,  the  Department  of  the  Computer  Science  of  the 
University  of  Illinois,  whose  encouragement  and  good  advice  have  led 
this  work  to  the  successful  completion.  Also  Paul  Kraska  read  the 
thesis  and  provided  valuable  comments. 

Special  thanks  should  go  to  Mrs.  Linda  Bridges  without  whose 
excellent  job  of  typing,  the  final  form  would  have  never  come  out. 
Thanks  are  also  extended  to  Mrs.  Diana  Mercer  who  helped  in  getting 
the  thesis  finished  on  time. 


IV 


TABLE  OF  CONTENTS 


Page 


1.  INTRODUCTION 1 

2.  PARALLEL  COMPUTATION  OF  SUMMATIONS,  POWERS  AND  POLYNOMIALS 11 

2 . 1  Introduction 11 

2 . 2  Summation  of  n  Numbers 14 

2 . 3  Computation  of  Powers 23 

2.4  Computation  of  Polynomials 31 

2.4.1  Computation  of  Polynomial  on  an  Arbitrary  Size 

Machine 31 

2.4.1.1  k-th  Order  Horner's  Rule 32 

2.4.1.2  Estrin's  Method 32 

2.4.1.3  Tree  Method 33 

2.4.1.4  Folding  Method 35 

2.4.1.5  Comparison  of  Four  Methods 38 

2.4.2  Polynomial  Computation  by  the  k-th  Order  Horner's  Rule...  39 

3.  TREE  HEIGHT  REDUCTION  ALGORITHM 52 

3«1  Introduction 52 

3-2  Tree  Height  and  Distribution 53 

3.3  Holes  and  Spaces 63 

3«3«1  Introduction 63 

3.3.2  Holes 70 

3.3.3  Space 76 

3.4  Algorithm 85 

3.4.1  Distribution  Algorithm 86 

3*4.2  Implementation 91 


V 

Page 

3«  5  Discussion 9J+ 

3.5-1  The  Height  of  a  Tree 9U 

3-5-2  Introduction  of  Other  Operators 98 

3-5-2. 1  Subtraction  and  Division 98 

3«5-2.2  Relational  Operators 99 

k .      COMPLETE  PROGRAM  HANDLING 100 

k.l     Back  Substitution  -  A  Block  of  Assignment  Statements  and  an 

Iteration 100 

k .  2  J'oops „ 110 

4.3  Jumps 113 

h.k     Error  Analysis llU 

5.  PARALLELISM  BETWEEN  STATEMENTS „ .  122 

5. 1  Program „ 122 

5.2  Equivalent  Relations  Between  Executions 125 

6.  PARALLELISM  IN  PROGRAM  LOOPS 135 

6.1  Introduction 135 

6.1.1  Replacement  of  a  for  Statement  with  Many  Statements 135 

6.1.2  A  Restricted  Loop li+1 

■  .  2  A  Loop  With  a  Single  Body  Statement 1^3 

6.2.1  Introduction li+3 

6.2.2  Type  1  Parallelism 146 

6.2.2.1  General  Case 146 

6.2.2.2  A  Restricted  Loop 153 

6.2.2.3  Temporary  Locations 156 

6.2.3  Type  2  Parallelism l60 

6.2.U   :onclusion 167 


Page 

6.3  A  Loop  With  Many  Body  Statements 167 

6.3*1  Introduction 167 

6.3«2  Parallel  Computation  with  Respect  to  a  Loop  Index 171 

6.3*3   leparation  of  a  Loop 173 

o .  3  •  3  •  1  Introduction 173 

6.3«3«2  The  Ordering  Relation  (e   )  and  Separation  of 

Loop 174 

6.3«3«3   temporary  Storage 179 

6.3*^  Parallelism  Between  Body  Statements 182 

oTjnrrT  Introduction 182 

6. 3« h.2     The  Statement  Dependence  Graph  and  the  Algorithm. .  184 

6.3»5  Discussion 190 

7 .  EQUALLY  WEIGHTED--TWO  PROCESSOR  SCHEDULING  PROBLEM 192 

7.1  Introduction 192 

7-2  Job  Graph 196 

7 . 3  Scheduling  of  a  Tight  Graph 199 

7  •  -V  Scheduling  of  a  Loose  Graph 2lU 

7.  5  Supplement 225 

8.  CONCLUSION 230 

LIST  OF  REFERENCES 233 

VITA 236 


VI 1 


LIST  OF  TABLES 


Table  Page 

2.1.  The  Parallel  Computation  Time  for  Summation,  Power  and 

Polynomial 12 

n 

2.2.  The  Number  of  Steps  Required  to  Compute  E   a.  on  P(m),  h  (m,n), 

i=l  x 
for  n  <  10 18 

2.3-   Computation  of  p  (x)  by  Folding  Method 38 

2.k.      The  Number  of  Steps  Required  to  Compute  p  (x),  h  (m, n),  for 

n  <  10 1+8 

k.l.     Comparison  of  Back  Substituted,  y  ,  and  Non-Back  Substituted 

Computation,  y .  --Iteration  Formulas 10l+ 

U.2.   Comparison  of  Back  Substituted,  y  ,  and  Non-Back  Substituted 

Computation,  y. --General  Cases 108 


Vlll 


LIST  OF  FIGURES 

Figure  Page 

1.1.  Statement  Dependence  Relation 3 

1.2.  Trees  for  ( (a+b)+(c+d) )  and  ( ( (a+b)+c)+d) 5 

1 . 3  •   Tree  s  for  a  +  b  x  c  +  d  and  bx  c  +  a  +  d 5 

lA.   Trees  for  a(bcd+e)  and  abed  +  ae 6 

2.1.  The  Minimum  Number,  M,  of  PE's  Required  to  Add  Numbers  in  the 
Minimum  Time 22 

2.2.  Computation  of  x   (l) 26 

2.3.  Computation  of  x  (2) 27 

.  -  .   Computation  of  x11  (3) 29 

2.5.   Computation  of  x"  (k) 30 

.6.  Computation  of  a.x  3k 

2.7.  A  Tree  for  p   .  .  (x) 35 

2.8.  A  Tree  for  p  +.(x) 37 

2.9*   Comparison  of  the  Four  Parallel  Polynomial  Computation  Schemes.  kO 

2 .10 .  k-th  Order  Horner '  s  Rule 1+1 

.11.  The  Number  of  Steps,  h  (m, n),  to  Compute  p  (x)  on  P(m)  by  the 

m-  th  Order kk 

2.12.  The  Minimum  number,  M,  of  PE's  Required  to  Compute  P  (x)  in  the 

Minimum  Time 51 

3.1.  An  Arithmetic  Expression  Tree  (l) 52 

.2.  An  Arithmetic  Expression  Tree  (2) 56 

3.3.   Free  Nodes 63 


Free  Nodes  in  a  Tree 6k 


3-5-  An  example  of  F.  and  F„ 66 

3.6.   Elimination  of  a  Free  Node 66 


IX 


Figure  Page 

5.7.  A  Minimum  Height  Tree. .  .  . 68 

3 .8 .  Attachment  of  T[  t '  ]  to  a  Free  Node 71+ 

3.9.  An  Example  of  Space  (l) 77 

3 .10 .  An  Example  of  Space  (2) 78 

3.11.  Distribution  of  t'  over  A 8l 

3  .12.  Tree  Height  Reduction  by  Hole  Creation 82 

3  .13»  Stacks  for  an  Arithmetic  Expression 91 

k.l.      A  Back  Substituted  Tree 102 

k.2.      Loop  Analysis 112 

.  .  A  Tree  with  a  Boolean  Expression 11^ 

•'i  .h  .  Trees  for  a(bc+d)  +  e  and  abc  +  ad  +  e 118 

5  .1 .  Conditions  for  the  Output  Equivalence 127 

6  .1 .   EQ 11*8 

.  .   E[  Ij Ik8 

6.y.      Conditions  of  Parallel  Computation  in  a  Loop 150 

■' .  •  .   An  Illustration  of  t  158 

6.5-   Wave  Front l6l 

.  .   Wave  Front  Travel l62 

6.7.  An  Illustration  for  Theorem  k 16U 

6.8.  An  Execution  by  a  Wave  Front 166 

.  .     ultaneous  Execution  of  Body  Statements 170 

6.10.  Execution  of  PB 173 

u  '^ 

6 .11 .  An  Introduction  of  Temporary  Locations 180 

6.12.  Wave  Front  for  Simultaneous  Execution  of  Body  Statements 183 

6 .13 .  A  Wave  Front  for  Example  10 187 


Figure  Page 

7.1.  Computation  of  Nondistributed  and  Distributed  Arithmetic 

Expressions  on  P(2) 19^ 


Common  Expression 195 

A  Loose  Graph  and  a  Tight  Graph 197 

A  Graph  G 201 

An  Illustration  for  Lemma  3 20U 

An  Example  of  a  Tight  Graph  Scheduling 209 

An  Illustration  for  Lemma  11 212 

A  Loose  Node 21^ 


7-2 
7.3 
1-h 

7.5 
7-6 

7-7 
7.8 

7-9 

7.10.  An  Example  of  the  Maximum  p-connectable  Distance 220 

7.11.  An  Illustration  for  Lemma  13 22U 

t  P 
7 .  12 .  An  Example  for  A  (-) 226 

7-13-  An  Example  for  p- connectivity  Discovery 228 


The  p-line  Relation  in  B  217 

n 


1.   INTRODUCTION 

1.1  Introduction 

The  purpose  of  this  research  is  to  study  compiling  techniques  for 
parallel  processing  machines. 

Due  to  remarkable  innovations  of  technology  today  such  as  the  intro- 
duction of  LSI,  it  has  become  feasible  to  introduce  more  hardware  into  computer 
systems  to  attain  otherwise  impossible  high  speeds.   For  example  Winograd  [k-2] 
showed  that  the  minimum  amount  of  time  required  to  add  two  t  bit  numbers  is 
[logpt]d  ([x]  denotes  the  smallest  integer  not  smaller  than  x),  where  we  assume 

that  an  adder  consists  of  two  input  binary  logic  elements,  e.g.  AND  or  OR  gates 
and  d  is  a  delay  time  per  gate.  An  adder  which  realizes  this  speed  requires  a 
huge  number  of  gates,  e.g.  approximately  1300  gates  for  t  =  6  [12],  and  it  has 
been  out  of  the  question  to  build  such  an  adder.  However,  the  introduction  of 
LSI  has  reduced  the  cost  of  a  gate  significantly,  e.g.  it  has  been  anticipated 
that  by  197^  the  cost  of  LSI  would  be  reduced  to  0.7  cent  per  gate  [33]«  Another 
example  is  a  class  of  parallel  processing  machines.   The  Illiac  IV  [7],  the  CDC 
6600  [k]   and  the  D825  [^1]  are  included  in  this  class.  A  machine  in  this  class 
has  e.g.  many  arithmetic  units  to  allow  simultaneous  execution  of  arithmetic 
operations.   As  an  extreme  case  it  has  been  suggested  to  include  special  arith- 
metic units,  e.g.  a  log  taking  unit  (in  x)  and  an  exponent  unit  (x  )  \  l6] .   (Such 
being  the  case,  this  decade  may  be  marked  as  a  "computer  architecture"  race, 
reminiscent  of  the  cycle-time  and  multiprogramming  races  of  the  60's  [12].)  We 
shall  not  go  into  the  details  of  machines  further.   An  extensive  survey  of 
parallel  processing  machines  is  found  in  [30]. 


Having  a  parallel  processing  machine  which  is  capable  of  processing  many- 
operations  simultaneously,  we  are  faced  with  the  problem  of  exploiting  parallelism 
from  a  program  so  Uiat  computational  resources  be  kept  as  busy  as  possible  to 
process  the  program  in  the  shortest  time.  We  now  discuss  the  problem  in  detail. 

In  this  thesis,  by  a  parallel  (processing)  machine  P  we  understand  a  set 
of  arbitrarily  many  identical  elements  called  processing  elements  (PE).  A  PE  is 
assumed  to  be  capable  of  performing  any  binary  arithmetic  operations,  e.g. 
addition  and  multiplication,  in  the  same  amount  of  time.   Furthermore  we  assume 
that  data  can  be  transferred  between  any  PE's  instantaneously.  Also  we  write  P(m) 
if  P  has  only  m  PE's.  A  machine  of  this  nature  may  be  considered  as  a  general- 
ization of  the  Illiac  IV. 

To  date  two  types  of  parallelism  exploitation  techniques  are  known  to 
compile  a  program  written  in  a  conventional  programming  language  (e.g.  ALGOL)  for 
the  parallel  processing  machine  [36].   They  may  be  termed  as  intra-statement 
parallelism  and  inter-statement  parallelism  exploitation  techniques.   The  first 
technique  is  to  analyze  the  parallelism  which  exists  within  a  statement,  e.g.  an 
arithmetic  expression  and  this  has  been  explored  by  Stone  [k-0],    Squire  r 39] ^ 
Hellerman  [20],  and  Baer  and  Bovet  [6].   For  example  consider  the  arithmetic 
expression: 

a+b   +c+d  +  e+f+g+h 
and  a  syntactic   tree   for  it: 


3 
2 
1 


level 


The  tree  is  such  that  operations  on  the  same  level  may  be  done  in  parallel.  The 
height  of  a  tree  is  the  maximum  level  of  the  tree  and  indicates  the  number  of 
steps  required  to  evaluate  an  arithmetic  expression  in  parallel.  Note  that  there 
may  be  many  different  syntactic  trees  for  an  arithmetic  expression,  and  among 
them  the  tree  with  the  minimum  height  should  be  chosen  to  attain  the  minimum 
parallel  computation  time.  Baer  and  Bovet's  algorithm  is  claimed  to  achieve  this 
end.  i.e.  build  the  minimum  height  syntactic  tree  for  an  arithmetic  expression  [6]. 

Exploitation  of  inter-statement  parallelism  has  also  been  studied  [10], 
[37]-  An  outcome  of  these  works  is  an  algorithm  (the  dependence  relation  detect- 
ion algorithm  [ 10] )  which  detects  the  dependence  relation  between  statements  in 
a  loop-  and  jump-free  sequence  of  statements.   The  dependence  relation  between 
S  and  S'  holds  if  S  proceeds  S'  in  a  sequence  and  S'  uses  the  output  of  S  as  an 
input  to  S ' .   For  example  the  algorithm  dects  that  the  statement  SI  in  Figure 
1.1  must  be  computed  before  the  statement  S2,  but  it  may  be  computed  simultaneous- 
ly with  S3- 

SI:  x  :=  f-^y); 

S2:  u  :=  f2(x); 


S3:  v  :=  f3(w); 
Sk:     z  :=  f^(v,u); 

(a)  program  (b)   dependence  relation 

Figure  1.1.   Statement  Dependence  Relation 

Since  in  a  real  program  the  major  part  of  the  execution  time  is  spent  within  loops 
if  it  is  executed  sequentially,  the  major  effort  should  be  directed  toward 
detecting  inter-statement  parallelism  in  loops.   For  example  we  would  like  to  find 


out  that  all  fifty  statements,  A[  1]  :=  f(B[l]),  ...,  A[  50]  :=  f(B[50]),  in  a  loop 

E:   for  I  :=  1  step  1  until  50  do 
A[I]  :=  f(B[I]) 
may  be  executed  simultaneously  to  reduce  the  computation  time  to  one  fiftieth  of 
the  original.  A  technique  available  now  which  detects  inter- statement  parallelism 
inside  a  loop  requires  a  loop  to  be  first  replaced  with  (expanded  to)  a  sequence 
of  statements,  e.g.  E  in  the  above  example  must  be  replaced  with  the  sequence  of 
fifty  statements,  A[ 1]  :=  f (B[ 1] ),  ...,  A[ 50]  :=  f (B[ 50] ),  so  that  the  dependence 
relation  detection  algorithm  can  be  applied  [ 10] .   Obviously  this  approach 
obscures  an  advantage  of  the  introduction  of  loops  into  a  program  because 
essentially  all  loops  are  required  to  be  removed  from  a  program  and  replaced  with 
straight-line  programs  so  that  the  dependence  relation  detection  algorithm  can  be 
applied  on  them. 

The  techniques  described  above  find  out  parallelism  inside  and  between 
statements  as  they  are  presented.   If  the  size  of  a  machine  (i.e.  the  number  of 
PE's)  is  unlimited,  however,  then  it  becomes  necessary  to  exploit  more  parallelism 
from  a  program  than  the  above  approaches  provide.   One  obvious  strategy  is  to 
write  a  completely  new  program  using  e.g.  parallel  numerical  methods  [52],  [38] • 
The  other  approach  which  we  will  pursue  here  is  to  transform  a  given  program  to 
"squeeze"  more  parallelism  from  it.  While  the  first  approach  requires  programmers 
(or  users)  to  reanalyze  problems  and  reprogram,  the  second  approach  tries  to 
accept  existing  sequential  programs  written  in  e.g.  AIG0L  and  execute  them  in 
parallel.   First  we  study  parallel  computation  of  an  arithmetic  expression  more 
carefully  along  this  line. 

For  the  sake  of  argument  let  us  assume  that  an  arithmetic  expression 
consists  of  additions,  multiplications  and  possibly  parentheses.   Then  the 
associative,  the  commutative  and  the  distributive  laws  hold.   The  first  and  second 


laws  have  been  already  used  to  exploit  more  parallelism  from  an  arithmetic 
expression.  For  example  the  associative  law  allows  one  to  compute  the  arithmetic 
expression  a+b+c+das  ((a+b)  +  (c+d))  in  two  steps  rather  than  as 
( ( (a+b)+c)+d)  which  requires  three  steps. 


(a+b)   +  (c  +  d) 


(((a  +  b)   +    c)   +   d) 


Figure  1.2.  Trees  for  ( (a+b)+(c+d))  and  ( ((a+b)+c)+d) 


Also  it  has  been  recognized  that  the  commutative  law  together  with  the  associative 
law  gives  a  lower  height  tree.  For  example  ((a  +  b  x  c)  +  d)  requires  three  steps 
while  (b  x  c  +  (a  +  d))  requires  two  [39]* 


b   x 


Figure  1.3-   Trees  for  a  +  b  x  c+d  and  b  x  c  +  a  +  d 

Now  we  turn  our  interest  to  the  third  law,  i.e.  the  distributive  law  and  see  if 
it  can  help  speeding  up  computation.  As  we  can  readily  see  there  are  cases  when 
distribution  helps.   For  example  a(bcd  +  e)  requires  four  steps  while  its 
equivalent  abed  +  ae  which  is  obtained  by  distributing  a  over  bed  +  e  can  be 
computed  in  three  steps. 


a     (b     c     d  +  e)  abcd  +  a 


Figure  l.k.      Trees  for  a(bcd+e)  and  abed  +  ae 

However,  distribution  does  not  necessarily  always  speed  up  computation.   For 
example  the  undistributed  form  ab(c+d)  can  be  computed  in  fewer  steps  than  the 
distributed  form  abc  +  abd.  Hence  nondiscr inn  native  distribution  is  not  the 
solution  to  the  problem.   Chapter  3  of  this  thesis  studies  this  situation  and 
gives  an  algorithm  which  we  call  the  distribution  algorithm.  Given  an  arithmetic 
expression  A  the  distribution  algorithm  derives  the  arithmetic  expression  A  by 
distributing  multiplications  over  additions  properly  so  that  the  height  of  A 

(we  write  h[A  ]  for  this)  is  minimized.   The  algorithm  works  from  the  innermost 
parenthesis  level  to  the  outermost  parenthesis  level  of  an  arithmetic  expression 
and  requires  only  one  scan  through  the  entire  arithmetic  expression.  Chapter  3 
concludes  by  giving  a  measure  of  the  height  of  the  minimum  height  tree  for  A  as 

well  as  A  as  a  function  of  fundamental  values  such  as  the  number  of  single 
variable  occurrences  in  A. 

The  idea  is  extended  to  handle  a  sequence  of  assignment  statements  in 
Chapter  k.      The  distribution  algorithm  is  applied  on  the  arithmetic  expression 
which  is  obtained  by  backsubstituting  a  statement  into  one  another. 
Suppose  we  have  a  sequence  of  n  assignment  statements  A  ,  A  ,  ...,  A  and  we  get 

the  assignment  statement  A  from  this  sequence  by  back  substitution.   If  the 


sequence  is  computed  sequentially,    i.e.    one  statement  after  another,   but  each 
statement  is   computed  in  parallel,   then  it  will  take  h[A_]    +  h[A  ]    +   ...    +  h[A  ] 

steps  to  compute  the  sequence  (where  h[A.]  is  the  height  of  the  minimum  height 

tree  for  A.).   Instead  we  may  compute  the  back  substituted  statement  A  in 

parallel  which  requires  h[A]  steps.   Obviously  h[A1]  +  ...  +  h[A  ]  >  h[A]  holds. 

Chapter  h   discusses  cases  when  the  strict  inequality  in  the  above  equation  holds. 

The  cases  include  iteration  formulas  such  as  x.  n  :=  a  x  x.  +  b. 

l+l        1 

Next  we  study  inter-statement  parallelism  in  terms  of  program  loops. 
Chapter  6  first  establishes  a  new  algorithm  which  detects  inter-statement 
parallelism  in  a  loop.   The  algorithm  is  such  that  it  only  examines  index 
expressions  and  the  way  index  values  vary  in  a  loop  to  detect  parallel  computa- 
bility.   For  example  the  algorithm  checks  index  expressions  I  and  I  +  1  as  well  as 
the  clause  "I  :=  1  step  1  until  20"  in  the  loop 

for  I  :=  1  step  1  until  20  do 
A[I]  :=  A[I+1]  +  B 
and  detects  that  all  twenty  statements,  A[  1]  :=  A[2]  +  B,  ...,  A[ 20]  :=  A[  21]  +  B, 
may  be  computed  simultaneously.   Thus  it  is  not  necessary  to  expand  a  loop  into  a 
sequence  of  statements  as  was  required  before  to  check  inter-statement  parallelism. 
In  general,  the  amount  of  work  (i.e.  the  time)  required  by  the  algorithm  is 
proportional  to  the  number  of  index  expression  occurrences  in  statements  in  a  loop. 

Having  established  the  algorithm,  Chapter  6  further  introduces  two 
techniques  which  help  to  exploit  more  inter-statement  parallelism  in  loops.   These 
are  the  introduction  of  temporary  locations  and  the  distribution  of  a  loop.   The 
second  technique  resembles  the  idea  introduced  in  Chapter  3,  i.e.  reduction  of 
tree  height  for  an  arithmetic  expression  by  distribution.  Let  us  write 


I,  J,  K(S1,  S2,  S3) 
for  an  ALGOL — like  program 

for  I  :-  i1,  ±2,    . ..,  im  do 

for  J  :=  Op  J2,  .--j  dn  do 

for  K  :=  kn,  k„,  ....  k  do 

1   2'      p  — 

begin  SI;  S2;  S3  end. 

t  ^  ^ 

Furthermore  by  e.g.  [I, J],  K(S1, S2,S3)  we  understand  a  loop" 

for  (I, J)  :=  (i-^^),  (i^Jg),  ••.,  (ip^),  (ig,^),  ...,  Un>dn)  do 

for  K  :=  k.. ,  k^,  ....  k  do 

1'   2'       p  — 

begin  SI;  S2;  S3  end. 
Then  as  in  the  case  of  arithmetic  expressions  we  may  establish  the  following: 

(a)  Association:   Introduction  of  brackets,  e.g.  I,  [ J,K] (SI,  S2,  S3) • 

(b)  Commutation:   Change  of  the  order  of  I,J,K  e.g.  I, K,  J(S1,  S2,S3). 

(c)  Distribution:  Distribution  of  I,J,K  over  SI,  S2,S3,  e.g. 

I,J,K(S1),I,J,K(S2,S3). 
Then  while  the  associative  law  always  holds,  e.g.  I,J,K(S)  =  [I,J],K(S),  the 
commutative  and  the  distributive  laws  do  not  necessarily  hold  for  all  loops,  e.g 
I, J(S)  /  J,I(S)  if  I,J(S)  represents  a  loop 
for  I  :=  1,  2,  3  do 
for  J  :=  1,  2,  3  do 

A[I,J]    :=  A[I+1,J-1]. 


JL 

"This   is   equivalent  to  a  TRANQUIL  expression  [2] 

for    (I,  J)    seq^   ((ip^);  (i^^l^-v  (im*dn))    do- 


In  short,  Chapter  6  shows  that  commutation  indicates  the  possibility  of  computing 
a  loop  in  parallel  as  it  is  and  distribution  indicates  the  possibility  of  intro- 
ducing more  parallelism  into  a  program.   For  example  if  I,J,K(S)  =  K,  I,J(S),  then 
S  can  be  computed  simultaneously  for  all  values  of  K  while  I  and  J  vary 
sequentially.   Next  suppose  a  loop  l(Sl, S2)  cannot  be  computed  in  parallel  for  all 
values  of  I.   Then  in  a  certain  case  it  is  possible  to  distribute  and  obtain  two 
loops  I (SI),  l(S2)  which  are  equivalent  to  the  original  loop,  I (SI, S2),  and 
execute  each  of  two  loops  in  parallel  for  all  values  of  I  separately.   Chapter  6 
gives  an  algorithm  to  distribute  to  attain  this  end. 

The  thesis,  thus,  introduces  new  techniques  which  transform  a  given 
program  to  expose  hidden  parallelism.  All  results  in  this  thesis  are  also  readily 
applicable  to  another  type  of  machines,  i.e.  machines  with  a  pipe-line  arithmetic 
unit  such  as  CDC  STAR  [  18]  (we  regard  this  type  of  machines  as  a  special  type  of 
parallel  machines  and  call  them  serial  array  machines).  Each  stage  of  a  pipe-line 
unit  may  be  regarded  as  an  independent  PE  in  the  sense  that  an  operation  being 
processed  in  one  stage  of  a  pipe-line  unit  must  not  depend  on  an  operation  being 
processed  in  a  different  stage.  Hence  exploiting  parallelism  results  in  busying 
many  stages  at  once. 

Two  more  chapters  are  included  in  this  thesis  to  make  it  complete. 
Chapter  2  studies  parallel  computation  of  special  cases  of  arithmetic  expressions, 
e.g.  powers  and  polynomials,  in  detail  to  give  a  measure  of  the  power  of  a 
parallel  processing  machine. 

As  was  mentioned  before,  unless  specially  mentioned,  it  will  be  assumed 
that  there  are  a  sufficient  number  of  PE's  available  to  perform  the  desired  task. 
In  reality,  however,  that  may  not  be  the  case  and  non  trivial  scheduling  problems 
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may  arise.   To  give  some  insight  to  this  problem  Chapter  7  discusses  a  solution 
to  the  two  processor-equally  weighted  job  scheduling  problem. 

We  conclude  this  chapter  by  defining  the  following  symbols: 
[x]    ...  the  smallest  integer  not  smaller  than  x, 
[  xj    ...  the  largest  integer  not  larger  than  x,  and 
\  x~|    ...  the  smallest  power  of  2  not  smaller  than  x. 

Also  unless  specified,  the  base  of  logarithms  is  assumed  to  be  2,  e.g.  log  n  is 
log2n. 
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2.   PARALLEL  COMPUTATION  OF  SUMMATIONS,  POWERS  AND  POLYNOMIALS 

2.1  Introduction 

In  this  chapter,  we  study  the  parallel  computation  of  summations, 

powers  and  polynomials.   We  first  assume  that  m  processors  (PE)  are  available. 

n 
Then  the  parallel  computation  times  for  the  summation  (  Z  a.  )  and  the  power  (x  ) 

i=l  1 

evaluation  are  given  as  functions  of  m  and  n.   The  minimum  time  to  evaluate 

n 

Z   a.  or  x  as  well  as  the  minimum  number  of  PE's  required  to  attain  it  is  also 
i-  1 

derived. 

Polynomial  computation  is  first  studied  assuming  the  availability  of 

an  arbitrary  number  of  PE's.   The  lower  bound  on  the  computation  time  "or  a 

polynomial  of  degree  n  (p  (x))  is  presented.   A  scheme  which  computes  p  (x)  in 

lesser  time  than  any  known  scheme  is  obtained.   Because  of  its  simplicity  in 
scheduling,  the  k-th  order  Horner's  "^ule  is  .studied  further  in  detail.   It  is 
shown  that  for  this  algorithm  the  availability  of  more  PE's  sometimes  increases 
the  computation  time. 

Table  2.1  summarizes  a  part  of  results  of  this  chapter. 

Before  we  go  further  a  few  comments  are  in  order.   The  base  of 
logarithms  in  this  chapter  is  2,  e.g.  log  n  is  actually  log  n.   The  following 

lemma  will  be  frequently  referred  to  in  the  text. 
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Lemma  1: 


Proof: 


(1)  flog  al    -    flog  bl    <   (log  a  -   log  bl 

where  a  and  b  are  non  zero  positive  integers. 

(2)  Ta  +  bl    =    Tal    +  b, 
la  +  bj    =    j_aj    +  b, 

f"b   -   al    =     b   -    Tal    +  1  and 
l_b-aj=     b   -    [_aj    -   1 

where  b  is  a  positive  integer  and  a  is  a  positive  real  number 
(not  an  integer); 

(3)  a  +  l>["al  >  a  and 
b  -  1  <  (bj  <  b 

where  a  and  b  are  non  zero  positive  real  numbers. 


(1)  Let  a  =  2h  +  k  and  b  =  2  +  g  where  0  <  k  <  2  -  1  and  0  <  g  < 

2  -  1.   Now  the  proof  is  divided  into  four  cases,  i.e.  (i)  k, 

g  >  0,  (ii)  k  =  g  =  0,  (iii)  k  -  0,  g  >  0,  and  (iv)  k  >  0,  g  =  0, 
(i):   k,  g  >  0. 

Then  flog  al  -  flog  bl  =  h  -  f.  Also  let  log  a  = 
h  +  x  and  log  b  =  f  +  y  where  0  <  x,  y  <  1.   Then 
flog  a  -  log  bl  >  h  -  f .   Thus  (log  al  -  flog  bl  < 
■log  a  -  log  bl  . 

Other  three  cases  may  be  proved  similarly  and  the  details  are 

omitted. 
(2) (3)  Proofs  for  (2)  and  (3)  follow  from  the  definition. 

(Q.E.D.) 


ih 


2.2  Summation  of  n  Numbers 


Theorem  1: 


The  minimum  number  of  steps,    h   (m,  n),    to  add  n  numbers  on  P(m)   is 
'(1):  n-1  (m=l) 

ha(m,n)   =  <((2):     Ln/mJ    -  1  +  '"log   (m  +  n  -Ln/mjmjl 

( l_n/2j    >  m  >  2 ) 
\p):      log  n"l  (m  >   |_n/2j ) 

Proof: 

(l)  is  self  evident.  (3)  uses  the  so-called  log  sum  method  [22]  or 
the  tree  method  (see  Theorem  1  of  Chapter  3)-  It  is  clear  that  "log  n"l  steps 
are  required  and  also  that  n  numbers  cannot  be  added  in  fewer  steps. 

Now  we  prove  (2).   First  each  PE  adds  [_n/m|  numbers  independently. 
This  takes  |_n/mj  "  1  steps  and  produces  m  partial  sums.   Then  there  will  be 
m  +  (n  -  [n/mj  •  m)  numbers  le.'t.   Clearly  m  +  n  -  [n/mj  •  m  <  2m.   Then  those 
numbers  are  added  by  the  log  sum  method,  which  takes  llog(m  +  n  -  (_n/mj  m) 
steps. 

(Q.E.D.) 
Now  we  show  that  for  a  fixed  n,  |_n/2j  >  m  >  m'  implies  that  h  (m, n)  < 

h  (m',n).   To  prove  this  it  is  enough  to  show  that  h  (m  +  1,  n)  <  h  (m, n)  where 
m  h-  1  £  |_n/2j  .   There  are  two  cases: 

(1)   (n/mj  =  |_n/(m  +  l)j  =  k  >  1. 

Let  n=km+p  (p  <  m).      Then 

m  +  n   -    LnAlJ    m  =  m  +  p 
and 
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(m  +  1)    +  n  -  l_n/(m  +  l)J    (m  +  1)   =  m  +  p  +  1   -  k. 
Hence  we  have 

flog(m  +  n  -  l_nAym)     >    flog((m  +  l)   -  n  -  |_n/(m  +  ill 
(m  +  1)?  ,    or  ha(m,n)  >  ha(m  +  l,n). 
(2)      jn/mj   >  Ln/(m  +  l)j  . 

Let  [n/mj  =  k  and  LnAm  +  l)j    =  k  -  g,    where  k,    g  >  1.      Then 
n  =  km  +  p  (p  <  m)  (l) 

and  n  -    (k  -  g)(m  +  1)   +  p'  (p'    <  m  +  l).  (2) 

Suppose  h   (m, n)   <  h   (m  +  1, n),    i.e. 

|||    -  1  +    flogfm  «-(2j.]     <  L;jij|    -  1  +   [WO*  +  1) 

+  n.    |^|(m  +  l)jl.  (3) 

By  substituting  Eq.  (l)  and  (2)  into  Eq.  (3)  and  by  rearranging 
we  get 

g  <  riog(ra  +  1  +  p')'  -  r"iog(m  +  p)~l  .  (k) 

If  we  can  prove 

g  <    ftog((m  +  1  +  p')/(m  +  p))1  ,  (5) 

then  by  Lemma  1(1),  we  can  prove  Eq.  (k) .      Eq.  (5)  holds  if 

2g  <  (m  +  1  +  p1 )/(m  +  p)  (6) 

holds.   Since 

2  +  l/m  >  (m  +   1  +  p* )/(m  +  p),  (7) 

Eq.  (6)  holds  if  and  only  if  g  =  1  (remember  that  g  >  1).   By 
letting  g  =  1  in  Eq.  (2),  we  get 

p'  =  km  +  p  -  (k  -  l)(m  +  1). 
Then  by  substituting  this  and  g  =  1  into  Eq.  (6)  and  by  proper 
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rearrangement,  we  get 

p  <  2  -  k. 
This  only  holds  if  k  =  1  and  p  =  0,  which  implies  that  n  =  m 
(see  Eq.  (l))  and  contradicts  our  assumption  that  m  +  1  <  |n/2j  . 

Hence  h  (m, n)  <  h  (m  +  1, n)  never  holds. 

The  above  two  cases  (l)  and  (2)  prove  that  h  (m  +  l,n)  <  ha(m, n). 
Thus  we  have  the  following  lemma. 

Lemma  2: 

ha(m',n)  >  ha(m,n)  if  m'  <  m  <  |_n/2j  . 

The  above  lemma  may  seem  insignificant.   In  Section  2.4.2,  however, 
it  will  be  shown  that  for  a  certain  algorithm  to  compute  an  n-th  degree 
polynomial  on  P(m),  the  computation  time  step,  h  (m, n),  is  not  a  nonincreasing 
function  of  m,  i.e.,  m  >  m'  does  not  necessarily  imply  that  lr  (m, n)  <  hp(m',n). 
As  it  will  be  described  later,  the  algorithm  is  such  that  all  PE's  are  forced 
to  participate  in  the  computation.   It  is  true  that  if  we  are  allowed  to  "turn 

off"  some  PE's,  then  we  always  get  hP(m, n)  <  hP(m' ,n)  if  m  >  m' .   Then  a 
question  is  how  many  PE's  are  to  be  turned  off.   These  problems  will  be  studied 
in  Section  2.4.2. 

It  should  be  noted  that  the  minimum  number  of  PE's  required  to  achieve 
the  minimum  computation  time  is  not  necessarily  |_n/2J  .   For  example  let  n  =  17- 

Then  (1T/2J  =  8  and  ha.n(l7)  =  ha(8,17)  =  5-   But  also  ha(6,17)  =  5- 

As  we  know,  the  minimum  computation  time  to  add  n  number  is  'log  n'. 
Now  we  present  the  minimum  number  of  PE's,  M,  which  achieves  this  bound. 
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Theorem  2: 

For  a  fixed  n,    let 

u 

M  -   u   (Ln/mJ    -  1  +   n.og(m  +  n  -    Ln/mj  m)'    =    log  rfl  )" 


m 


Then 

M  =  f(l)        1  +   L(n  -  l)/2j    -  2k~2  (2k  <  n  <  2k  +  2k_1) 

(2)        n   -   2k  (2k  +  2k_1  <  n  <  2k+1) 

where  k   -   l_log(n   -  l)j  . 

Proof: 

For  k  <  3,    the  direct  examination  shows  that  the  theorem  holds   (see 

Table  2.2).      Therefore  we  assume  that  k  >  3« 

k  k         k-1 

The  proof  is   divided  into  two  parts,    i.e.    (l)2     <n<2     +2      "  and 

(2)   2+2         <n<2        .      In  either  case  we  first  prove   that  h   (M,  n)   =    'log  nl 

7hen   for  m  <  M     we   show  that  h   (m,  n)  >   flog  n~| . 

It   should  be  clear  that   in  both  cases   2     <  n  <  2         and  Tlog  nl  = 


k  +  1, 


(1)        2k  <  n  £  2k  +  2k_1 

Let  n  =  2k  +  p.   (l  <  p  <  2k_1)  (8) 

>w  we  first  show  that  h  (m, n)  =  'log  nl  where 

M  =  1  +  |jn  -  1)/2J  -  2k*2  (9) 


4 

u( condition)  denotes  the  minimum  value  of  m  which  satisfies  the 
m 

condition. 
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m 


1 

2 

3 

k 

5 

k 

M 

Case 

2 

1 

0 

1 

2 

3 

2 

2 

1 

1 

1 

k 

3 

2 

1 

2 

2 

5 

k 

3 

3 

3 

2 

2 

1 

6 

5 

3 

3 

3 

2 

2 

1 

7 

6 

1+ 

3 

3 

2 

3 

2 

8 

7 

k 

k 

3 

2 

4 

2 

9 

8 

5 

k 

h 

If 

3 

3 

1 

10 

9 

5 

h 

k 

k 

3 

3 

1 

(1):  2k<n<2k  +  2*"1 

(2):  2k  +  2k"1  <  n  <  2k+1 
where  k  =  [log(n-l)j. 
n:   The  degree  of  a  polynomial 

m:   The  number  of  PE's 

M:   The  minimum  number  of  PE's 


Table  2.2.   The  Number  of  Steps  Eequired  to  Compute  2  a.  on  P(m),  h  (m, n), 


i=l 


for  n  <  10. 
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Now 

ha(m,n)  -  Ln/MJ  -  1  +  riog(M-n-Ln/M|  M)l  •         (10 ) 

We  then  show  that  [n/Mj  =  3  for  all  n.   Then  we  get  (M  -  n  - 

|n/M]M)  =  n  -  2M.   The  value  of  n  -  2M  is .  evaluated  in  three 

ways,  i.e.  (i)  p  =  2g,  (ii)  p  =  2g  +  1  (g  >  l)  or  (iii)  p  «  1. 

k-1 
In  any  case  n  -  2M  £  2   "  holds.   Thus  we  can  prove  that 

log   (M  -  n   -   Ln/MJM)1    <  k  -   1. 

and 

ha(M,n)  <  k  +  1  =    Tlog  ril  . 

Then  we  prove  that  for  all  m  <  M,  h  (m, n)  >  ("log  nl  .   This  is 

proved  by  showing  that  h  (M  -  1,  n)  >  I  log  nl  .   Then  by  Lemma 

2,  h  (m,  n)  >  ("log  n~l  for  all  m  <  M. 

Now  let  us  show  the  details.   First  we  prove  that  h  (M,  n) 
log  ill.   From  Eq.  (8)  and  (9),  we  get 

M  =  2k"2  +  1  +  jjp  -  l)/2j,  (11) 

and  by  Lemma  1(2),  we  get 

[jj|  "  5  -  LpJ  (12) 

where 

p  =     ^L(p  -  D/2J   +  k  -  y 

2k"2   +  1   +    |ip   -   1)/2J 
By  Lemma  1(3),    we  get 


P  <  P'    = 


2p  -  h 


0k-l 

2  +  p   -   1 


k-1 
Now  we  show  that  for  all  p(l<p<2   ),  P'  <  1.   Since 
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^  ■  (2k~1  +  p  -  I)2  >  ° 
we  have 

max  P*  =  %-^ —  <  1   for  1  <  p  <  2k_1  (k  >  k). 
2-1  "   " 

Thus  P  <  1  and  by  Eq.  (12)  we  have  (n/Mj  =  3«   Substituting  this 

into  Eq.  (10),  we  get  ha(M,n)  =  2  +  llog(n  -  2M)1  .   Now 
subtracting  two  times  Eq.  (ll)  from  Eq.  (8)  we  have 

n  -  2M  =  2k_1  +  p  -  2  -  2j_(p  -  l)/2j  .  (13) 

Eq.  (13)  is  evaluated  in  three  different  ways  according  to  the 
value  of  p,  i.e.  (i)  p  =  2g,  (ii)  p  =  2g  +  1  (g  >  l)  or 
(iii)  p  =  1  (in  every  case  g  is  an  integer). 

(i)  p  =  2g  (From  Eq.  (l),  g  >  l). 

n  -  2M  =  2k_1. 
(ii)  p  =  2g  +  1,  (g   1). 

n   -  2M  =  2k~1  +2g+l-2-2g<  2        . 
(iii)      p  =  1. 

k-1  k-1 

n  -  2M  =  2    +  1  -  2  <  2   . 

k-1     p-        ~i 

Hence  in  any  case  n  -  2M  <  2   "or  llog(n-2M)'  <  k  -  1.   Thus 

ha(M,n)  <  k  +  1  =  Tiog  n"l  . 
This  proves  the  first  part  of  (1).   Next  we  prove  the  latter 

part,  i.e.  for  m  <  M,  h  (m,n)  >  h  (M,n). 
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First  we  show  that  ha(M  -  1,  n)  >  ha(M,  n).   From  Eq.  (8) 
and  (9),  and  using  Lemma  1(2)  we  get 

\vrh\  =  i>  -  QJ  =  3  -  lQj>  a1*) 

where 

Q=  H(p  -  D/2J  -  P  . 

2k_2  +  L(p  -  1)/2J 
As  we  showed  for  P,  we  can  also  prove  that  for  all  p  (l  <  p  < 

2k_1),  Q  <  1.   From  Eq.  (1*0,  we  have  |_n/(M  -  l)j  =3-   Then 

ha(M  -  1,  n)  =  2  +  Tiogdi  -  2(M  -  1)11  .   From  Eq.  (8)  and  (11), 
we  get 

n  -  2(M  -  1)  =  2k-1  +  p  -  2|_(p  -  l)/2j  >  2k-1  +  1 

or    Tlog(n  -  2(M  -   l))"l     >  k.      Hence 

Tlog  nl    =   k  +  l<k+2<  ha(M  -   1,    n). 

Thus  for  all  m  <   M,  h  (m, n)  >  Tog  nl  by  Lemma  2,  and  this  proves 
(1). 

(2)   2k  +  2k_1  <  n  <  2k+1. 

Let  n  -  2k  +  2k_1  +  p  (l  <  p  <  2k_1).  (15) 

We  first  sketch  the  proof.   We  first  show  that  h  (M,  n)  = 

flog  nl  .   To  show  this  we  prove  that  l_n/Mj  =  2  for  all  n  (2  + 

k-1       k+1 

<  n  <  2   ).   Then  using  this  we  get  M  +  n  -  (_n/Mj  M  = 

n  -  M.   We  further  show  that  n  -  M  =  k.   Thus  we  get  ha(M, n)  = 

k  +  1  =  flog  nl.   Then  we  prove  that  ha(M  -  1,  n)  >  ha(M,  n) 
which  together  with  Lemma  2  completes  the  proof  of  (2). 
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The  details  are  similar  to  (l)  and  will  not  "be  given  here. 

(Q.E.D.) 

Theorems  1,  2  and  Lemma  2  also  apply  to  the  case  of  multiplication  of 

n  numbers,  and  to  avoid  duplication,  the  corresponding  lemmas  for  multiplication 
shall  not  be  presented. 

Next  let  us  consider  the  power  computation,  e.g.  x  (n  >  2). 

2.3  Computation  of  Powers 

Lemma  3  '■      [23] 

Let  N  be  the  number  of  ones  in  the  binary  representation  for  n. 

Then  the  near  minimum  number  of  steps  to  compute  x  on  P(l),  h  (l, n), 
is 

he(l,n)  -  LloS  nJ  +  N  -  1. 

Up  to  now,  there  is  no  result  about  the  minimum  computation  time  to 

evaluate  x  [23].   Thus  we  shall  settle  for  an  approximation.   For  example  let 

e  15 

15.   Then  h  (1,15)  =  l_lo§  15J  +4-1  =  6.   On  the  other  hand  x   can  be 

evaluated  in  fewer  steps,  e.g. 

(x)2  =  x2  -  'x2)(x)  =  x3  -  (x3)(x2)  =  x5 

f   5n2    10   ,  5w  10 x    15 
-*  (x^)  =  x   -  (x  )(x  )  =  x  •  . 

This  takes  only  5  steps.   For  n  <  70,  this  lemma  gives  the  correct  values  for 

more  than  70$  of  the  cases. 

While  we  cannot  give  the  definite  answer  for  the  sequential  case,  we 

can  prove  the  following. 


2k 


Theorem  J>: 

The  minimum  number  of  steps  to  compute  x  on  P(m)  (m  >  1),  he(m,n), 

is 

h"(m,  n)  =  ("log  nl  . 

Proofs  of  Lemma  3  and  Theorem  3: 

Let  a  =  log  n  and  3  =  log  (n  +  l)  for  convenience.   Let  I  be  the 

j-th  most  significant  bit  in  the  binary  representation  for  n.   If  m  =  1,  then  x 
is  computed  as  follows.   First  let  us  write 

X.  -  (x2V3+1. 

J 

We  first  compute  all  x  (i  =  2  ,  2  ,  . ..,  2  ^ -• )  in  iqi  steps.   Then 


Xn=  (XQ)x  (Xx)  x  ...  x  (x2Lq,)Irpl 


and  this  computation  takes  N  -  1  steps  (note  that  if  I .  =  0,  then  X.  ,  =  l). 
Thus  in  total  i.-^j  +  N  -  1  steps  are  needed. 

If  two  PE's  are  used,  then  x  is  computed  as  follows.  Again  let  us 

write 

X  =  ( XQ )  x  ( X1 )  x  • . •  X  ( x    ) 

Now  this  can  be  computed  by  the  following  two  recursive  equations. 
t(l)  -  x 

k      k-1  k-1 
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k      k-1   k-1' 


and 


xn  -  t(2) 
Two  PE's  are  required  for  the  simultaneous  computation  of  t/,   and  t}.   . 

K.  K 


That  the  above  process  for  P(2)  is  optimum  is  clear,  because  x 
cannot  be  computed  in  less  than  [log  nj  steps  and  at  least  l_log  n)  +  1 

(=  llog  nl )  steps  are  required  to  compute  x  . 

(Q.E.D.  ) 

From  the  above  discussion,  we  have  the  following  corollary. 


2Llog  "J 


lorollary : 


he(m,n)  =  h  (2,n)  for  all  m  >  2. 

wow  let  us  study  simultaneous  computation  of  all  x  (i=l,2, . . .,n) 


Theorem  h 


The  minimum  number  of  steps,  h  (m,n),  required  for  simultaneous 


evaluation  of  all  x  (i=l, 2, . . .,n)  on  P(m)  is 

r 

(1)       n  -  1  (m  =  1) 


h    (m,n)   =  -< 


V 


(2)  Llog  mj    +  1  +    r(n  _   2Llog  mJ+1)/m1 

(max(n.2riogn"1-1,    2  ^  n1  "2)  >  m  >  2) 

(3)  flog  nl       (m  >  max  (n  -  2  riog  n_1  "\ 


flog  nl  „  HLog  ril  -2, , 
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Proof: 

(l)  is  obvious.   (3)  is  illustrated  in  Figure  2.2.  At  the  k-th  step, 

i       k-1       k-1  k\ 

the  x  (i  =  2    +  1,  2    +  2,  ...,  2  )  are  computed  using  the  results  of 

k     k-1     k-1 
earlier  steps,  e.g.  x   =  x    x  x    .   The  number  of  PE's  required  at  this 

step  is  then  2k  -  (2k-1  +  l)  +  1  =  2  "  . 


\    PE 

step\ 

pi 

P2 

P3 

\ 

" 

No.    of  PE's 
required 

1 

X2 

1 

2 

x3 

X 

2 

3 

5 

X' 

6 

X 

x7 

8 

X 

1+ 

• 

. 

k 

a+1 

X 

a+2 

X 

•  • 

2a 

X 

2k-i 

r log  nl-l 

b+1 

X 

.  . 

2b 

X 

b 

r log  nl 

c+1 

X 

" 

n 

X 

n  -  2c 

a-  2^       b=  2ri°Sn1-2 


c  =  2b 


Figure  2.2.   Computation  of  x  (l) 


The  maximum  number  of  PE's  required  is  the  larger  one  of  n  -  2         (the 

number  of  PE's  required  at  the  last  step)  or  2  °S     "  (the  number  of  PE's 
required  at  the  (  Hog  h~l  -  l)-th  step).   This  proves  (3).   Clearly  this 
procedure  is  optimum  in  the  sense  that  it  gives  the  minimum  computation  time. 
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max(n  -  2 


Next  suppose  that  the  number  of  PE's  available,   m,    is  less  than 
flog  ffl  -1     /log  nl  -2}>      ^  first  all  xi(l  <  ±  <  2  l_l°g  *J  +  1)   are 


:omputed  in    [log  mj    +1  steps  in  the  same  manner  as  the  above  procedure. 


Number  of  PE's 
* m     


1 

2 

X 

x\\\\\\\\\\\ 

2 

x3 

i2 

WWWWV 

• 

I  log  mj    +  1 

a 

X 

xb 

\V 

. 

c 

X 

• 

n 

\\\\X 

~T" 

J  log  rnj+1 

r         _ 1 


m 


i 


a  =   2llogmj    +1 
b  =  2[ log  mj    +  1 


c  =  b   +1=   2ll°gmJ    +1  +1 


Figure  2.3-   Computation  of  x  (2) 
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Now  there  are  n  -  2  ^°g  m-J+1  x1  left  (2^°g  ^ +1  <  i  <  n)  to  be 

computed.   This  takes  "(n  -   2  •-  g  -•   )/m'  steps  on  P(m).   Clearly  at  each  step, 

all  necessary  data  to  perform  operations  are  available.   To  show  this,  let  us 

a    b 
take  two  successive  steps.  Assume  that  the  first  step  computes  x  ~  x  where 

b  -  a  f  1  =  m.   Then  the  second  step  computes  x      ~  x     .   Since  b  +  m  = 
2b  +  1  -  a  <  2b  (a  >  1),  all  inputs  required  at  the  second  step  are  available 

from  the  first  step.   Thus  in  total  (log  mj  +  1  +  "(n  -   2 L  °g  ^   )/m~'  steps 
are  required.   This  proves  (2). 

(Q.E.D.) 

Clearly  for  fixed  n,  m  >  m'  implies  that  h  (m,n)  <  h  (m',n).   Thus 
we  have: 

Lemma  h : 

w  /     \ 

For  fixed  n,  h  (m,  n;  is  a  non-increasing  function  of  m. 

We  again  call  the  reader's  attention  that  the  number  of  PE's  required 

to  compute  all  x  in  the  minimum  number  of  steps,  i.e.  'log  n~l ,  is  not 

•  -,     t     o  flog  nl  -1  _  flog  n~l  -2N   _       ,   ,       lQ   _ 

necessarily  max(n-2        ,2        ).   For  example,  let  n  =  io.   Then 

max(l8  -  2  ,2  )  =  8,  and  'log  18  =  5«   Yet  P(5)  achieves  the  same  result,  i.e. 
hwK,l8)  =  5- 

Lemma_5: 

The  minimum  number  of  PE's,  M,  necessary  to  compute  all  x  (l  <  i  <  n) 
simultaneously  in  the  shortest  time  is 
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M  = 


(1)  n  -   2ri0g  3'1 

(2)  r(n-   2&**m2)/2 


(n   -  2ri0g  ^  _1  >  2ri°S  ^  "2) 
(otherwise) 


Proof: 


Let   q  =    ilog  n~l  . 

(1)     n  -  2Crl  •>  2°~2. 

Suppose  that  there  are  only  m  PE's  where  m  <  n  ■■  2 


Q-l 


step\ 

pi 

P2 

PJ 

•• 

P 

m 

riog  nl    -  1 

r log  nl 

a 

X 

a+ni 

X 

n 

X 

s- 


m 


left  out 


a=   2fl0g  nl    -   X  +  1 


Figure  2.k.      Computation  of  x  (3) 


Then  x(2°""+l+m<i<n)  cannot  be  computed  at  the  a-  th 
step.  Also  none  of  them  can  be  computed  at  an  earlier  step 
because  their  inputs  are  not  ready.   This  proves  (l). 

(2)   n  -  2~rl  <  2r2. 
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flog  nl    -  1 


f log  nl 


,f  log  nl   -  2  +  1 


2riognl    -2  +m 


[log  nl   -  l 


Figure  2.5*      Computation  of  x     (k) 


Then  we   delay  the   computation  of  all  x1(2Qr<:"   +  m  +  l<i<2a-  l) 
which  are   originally  scheduled  to  be  computed  at  the    (a  -  l)-th 
step  till  the  next   step,    i.e.    the  <.rth  step   (i.e.    the  last   step). 

At  the  o-th  step  we  compute  all  remaining  x  {2°rt'  +m  +  1  <  i  <  n). 
The  value  of  m  has  been  chosen  in  such  a  way  that 


,a-2 


cr2 


(2U  -   1)    -   (2urc  +m  +  l)+l>n-    (2urc   +  m  +  l)   +  1 


i.e. 


n 


,cr2 


m  =    "(n  -    (2°^   +  1)   -l)/2». 
Also  since 

2(2°~2   +  m)   >  2(2^2   +  n/2   -   2a"3) 


=   n  +  2a~2    >  n, 
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all  inputs  required  at  the  cr th  step  are  ready  at  the  (a  -   l)-th 
step  or  earlier  steps. 

(Q.E.D.) 

2.k     Computation  of  Polynomials 

In  this  section,  we  study  polynomial  computation.   First  we  assume 
that  there  are  arbitrarily  many  PE's.   Then  four  schemes  are  studied  and 
compared.   Two  of  them  are  known  as  Estrin's  method  [16]  and  the  k-th  order 
Horner's  rule  [15].   Two  new  methods  are  also  introduced.   They  are  called  a 
tree  method  (see  Chapter  3)  and  a  folding  method.   It  is  shown  that  if  there 
are  arbitrarily  many  PE's,  then  the  folding  method  gives  a  faster  computation 
time  than  any  known  method. 

Then  we  study  the  case  where  only  a  limited  number  of  PE's  are 
available.   Because  of  the  simplicity  of  scheduling,  the  k-th  order  Horner's 
rule  is  studied  in  detail.   It  is  shown  that  on  P(m)  the  m-th  order 
Horner's  rule  does  not  necessarily  guarantee  the  fastest  computation,  i.e., 
there  is  a  case  where  the  m'-th  order  Horner's  rule  (m'  <  m)  gives  a  better 
result.   Thus  availability  of  more  computational  resources  does  not  necessarily 
"speed  up"  the  computation  for  a  certain  class  of  feasible  parallel  computation 
algorithms. 

. I . 1   "omputation  of  a  Polynomial  on  an  Arbitrary  Size  Machine 


Definition 


We  write  p  (x)  for  a  polynomial  of  degree  n 


i    \      n        n-1 
p  ( x )  =  a  x  +  a  ,x    +  . . .  +  art . 

rn'      n      n-1  0 
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2. if. 1.1  k-th  Order  Horner's  Rule  [15] 

The  details  will  "be  presented  in  Section  2.1J-.2.   Theorem  5  shows  that 
the  minimum  time  required  to  compute  p  (x)  by  this  method  is 


n 


hP.      =    flog  n"l    +    (log    (n+l)~l    +  1 

mm 


2.4.1.2     Estrin's   Method  fl5]["l6l 
We   first  compute 

C°  =  a     +  xa  i  =  0,    2,    ...,    2  Ln/2j 


i+1 
Then   successively  compute 


c}  =   C°  +  x2C°±+2  i  =  0,    k,    ...,    k^/kj 

C?    =  c]  +  xV  ,  i  =  0,   8,    ...,   8Ln/8j 


ii  i 
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m       „m-l  2m^m-l  .        n      _m+l  m+1      /om+l 

C.     :   C.         +  x     C.+2m  i   =   0,    2        ,     ...,    2        (_n/2      J 

where  m  -    [log  nj    and  more  over 

Pn<x)  .  C 

The  procedure  may  be  illustrated  by 

o 

p  (x)  --  an  +  a  x  +  x~(&     +  ax) 
n       0     1        2     .p 

^  2/       n 

+  x  (a,  +  ax  +  x  (a^-+a  xjj 

ftp  1+  2 

+  x   (ag  +  ax  +  x   (a10+a1JLx)    +  x  (a12+a     x+x   (a^+a     x))) 


Now  notice   that   for  each  j   all  C?  may  be   computed  simultaneously. 
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Assuming  the  availability  of  an  arbitrary  number  of  PE's,  it  is  easy 
to  show  that  the  minimum  number  of  steps  to  compute  p  (x)  by  this  method  is 


hE.   =  2 flog  (n  +  l)1  . 


E 

mm 


2.4.1.3  Tree  Method 

In  this  method  a  tree  similar  to  the  one  for  arithmetic  expressions 
is  built  for  a  polynomial  p  (x).   For  example  p,(x)  is  computed  as: 


Computation  by  this  method  consists  of  two  stages,  computation  of 


b.  =  a.x  (0  <  i  <  n)  and  computation  of  Lb.. 
1  i=0 


n 

t 
i 


(l)  Computation  of  b.  =  a.x  (0  <  i  <  n). 

This  requires  flog(i+l)l  steps  by  Theorem  3  (see  Figure  2.6) 


(2)  Computation  of  lb,. 

i=0  x 

As  soon  as  b.'s  become  available  they  are  added  in  the  log  sum 

way.   Suppose  b.  becomes  available  at  the  k-th  step.   Having  a 

k-1 
variable  at  the  k-th  step  is  equivalent  to  having  2    variables 

originally  at  the  first  step  (cf.  the  effective  length  in  Chapter 

3)- 


^ 


Step 

Terms  Which  Are  Computed 

No.    of  Terms  Computed 

1 

ao 

1 

2 

a.,  x 

1 

J 

2           3 

a2X   '    a3X 

2(=2X) 

■ 

h           5           6           7 

cL    X     «       a,_-X     ^      cl/'X     ^      a^X 

M=22) 

riog(n+l)l 

2riog(n+l)l  -2 

riog(n-i  1)1+1 

2  riog(n+i)]-i         n 
a2riog(n+i)i-ix  '                 •••>** 

n_2riog(n+l)l-l+1 

Figure  2.6.   Computation  of  a.x 


Thus,  for  example,  two  variables  at  the  third  step  are  reduced 
to  8  variables  at  the  first  step.  Repeating  this  procedure  we 
get 

o-l 

Z  ( 
i=l 

o-l 

Z  'c 

i=l 

variables  on  the  first  level  where 
a    -  1og(n  +  if"  . 

To  add  these  n'  numbers  by  the  log  sum  method,  it  takes 


Z  (21'12i)    +  2a(n  -  20rl  +  1) 


+      Z22i_1   +  2a(n   -   2Qrl   +  1) 


h  =     H  nrr    n"H 


mm 


log  n'1     steps. 
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2.U.1.U  Folding  Method 

This  is  3.   method  which  computes  Pn(x)  in  shorter  time  than  any  known 


method. 


Assume  that  p  , (x)  can  be  computed  in  h  -  1  steps,  p.(x)  (t  <  i  < 
s  -  1)  are  computed  in  h  steps  and  p  (x)  can  be  computed  in  h  +  1  steps. 


Then  we  show  that  all  p.(x) 


roof 


Steps 

Degree 

h   -  1 

~  t   -  1 

h 

t  ~  s  -  1 

h  +  1 

s  ~    (s   +  t   - 

1) 

h  +  2 

s    +    t   ~ 

(s  <  j  <  s  +  t  -  1)  can  be 
computed  in  h  +  1  steps  and 
further  p   (x)  can  be 

computed  in  h  +  2  steps. 


(1)  First  we  show  that  pg+t _.(x)  (l  <  ,j  <  t)  can  be  computed  in 
h  +  1  steps. 


Figure  2.7-  A  Tree  for  p    .(x) 

*s+t-jv 
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We  write  p    .(x)  as 
s+t-jv 


/v\     /        t-J  v  s         s-1 

.,+.  .(xj  =  (a.    .x  d  +  ...  +  a  )x  +  a  nx    + 
=+t-j         t+S-J  s        s-1 


. .  +  a 


s 


=  pt   (x)x  +Ps.1(x). 


s 


Now  we  show  that  x  can  be  computed  in  less  than  h  steps. 

From  Theorem  1  we  know  that  x  can  be  computed  in  flog  nl 

g 

steps.   Suppose  that  the  computation  of  x  takes  longer  than  h, 

i.e. 

h  <  rio£  si  .  (16) 

Now  from  the  assumption,  p  , (x)  takes  h  steps  to  compute.  Also 

S  *~  J_ 

(see  Section  2.U.1.5) 

h  >  llog(2(s-l)  +  l)1  -  G.og(2s-l)1  .  (17) 

Prom  Eq.  (16)  we  have  s  >  2    +  1  and  from  Eq.  (17 )  we  have 

2h  >  2s  -  1  or  2h_1  >  s.   Thus  we  have  s  >   2h_1  +  1  >  2h_1  >  s 
which  is  a  contradiction.   Thus  h  >  llog  s~l  . 

From  the  assumption  p,  . (x)  can  be  computed  in  h  -  1  steps 


and 


p   ,  (x)  can  be  computed  in  h  steps.   Hence  p   .(x)  •  x  takes 


h  steps  and  p    .(x)  can  be  computed  in  h  +  1  steps. 
^      *s+t-j 

2)  Next  we  show  that  p   (x)  can  be  computed  in  h  +  2  steps. 
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ps-l(x) 


t+l 


Pt(x) 


Figure  2.8.      A  Tree  for  p     . (x) 

S  "T*  U 


We  write 

/     \  /  s-1  \    t+l  t 

Ps+t(x)   =    (as+tX  +   •*•    +  at+l)x  +  atX     +   •'•    +  a0 

-  Ps.1(x)    •    xt+1  +px(x). 

Then  from  the  previous  proof  we  know  that  x   "  can  be 
computed  in  less  than  h  +  1  steps.   Since  p  , (x)  and  p  (x)  take 

h  -  1  and  h  +  1  steps  respectively,  we  can  compute  p   (x)  •  x   "in 


at  most  h  +  1  steps  and  p   (x)  in  at  most  h  +  2  steps. 


(Q.E.D. ) 


It  is  easy  to  check  that  pp(x)  takes  3  steps,  p,(x)  and  p^(x)  can  be 
computed  in  k   steps  and  pq(x)  can  be  computed  in  5  steps.   By  induction  we 
obtain  the  following  table  for  h  =  2,  3,  ...,  10. 
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Minimum  Steps 

Degree  of  Polynomial 

3 

2 

k 

3  -  1* 

5 

5  -  7 

6 

8  -12 

7 

13-20 

8 

21-33 

9 

3^-5^ 

10 

55-87 

11 

89-1^3 

Table  2.3.   Computation  of  p  (x)  by  Folding  Method 


For  example,  p  (x)  takes  9  steps  to  be  computed.   Note  that  the 
first  numbers  in  the  right  column  form  a  Fibonacci  sequence. 


2.4.1.5   Comparison  of  Four  Methods 

It  has  been  proved  that  at  least  2n  operations  are  required  to  compute 
p  Lx  ) .   ■  roof s  appeared  in  several  papers.  We  owe  Ostrowski  [3M  and  Motzkin 

'I  for  their  original  works.  Pan  [35]  summarized  the  results.  An  excellent 
review  of  the  problem  appears  in  [23].  Also  Winograd  [^3]  generalized  results 
of  Ostrowski  and  Motzkin. 

Now  assume  that  to  compute  p  (x)  in  parallel  h  steps  are  required. 

Then  Theorem  5  gives  h  <  log  r?    +  1.og(n+l)l  +  1  <  2fiog(n+l)'  +  1.   Also 

k 
since  2-1  operations  can  be  performed  in  a  parallel  computation  tree  of 
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height  k,    we  have  2h  -   1  >  2n  or  h  >    Hog  ( 2n+l  J1  .      Thus 

2nLog(n+l)1    +  1  >  h  >    flog(2n+l^. 

In  Figure  2-9,  these  upper  and  lower  limits  are  plotted  together 
with  the  results  from  the  previous  section.   It  is  clear  that  the  folding  method 
is  the  best  in  terms  of  the  computational  speed.   It  is  yet  an  open  question 
if  there  is  a  better  method. 

2.U.2   Polynomial  Computation  by  the  k-th  Order  Horner's  Rule 

Now  let  us  study  computation  of  a  polynomial  by  the  k-th  order  Horner's 

rule. 

A  polynomial  is  computed  by  the  k-th  order  Horner's  rule  as 

shown  by  the  following  procedure.  We  use  k  PE's.   First  we  compute  all 

x  (1  <  i  <  k)  simultaneously.   Then  we  compute  k  polynomials  p  ' (x)  on  k  PE's 

simultaneously,  where 

p.  '(x)  =  a.  +  xk(a.^  +  xk(a.  _  +...)))     (0  <  i  <  k  -  l). 
^k   '    l       l+k       i+2k  —   — 

Then  we  get  k  partial  results  which  are  added  to  get  p  (x).   Figure  2.10 

i  1  I  .si  rates  this. 

This  scheduling  may  not  be  the  best,  yet  it  is  easy  to  implement 

i  also  adaptable  to  any  number  of  k. 

Theorem  5 ' 

The  minimum  number  of  steps,    hP(m, n),    required  to  compute  a  degree 
n  polynomial  p   (x)   on  P(m)   by  the  m-th  order  Horner's  rule  is 

(1)     2n  (m  =  1) 

hP(m,n)   =  -{    (2)     2  (log  m~l    +  2[n/mj    +1  (n  +  l>m>2) 

(3)      frog  rH    +    flog(n+l)1    +  1.    (m  >  n  +  l) 
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kl 


Proof: 

A  proof  for  (l)  can  be  found  in  [k-3].      (2)  and  (3)  are  self-evident 
from  the  above  discussion. 

(Q.E.D.) 


k-1 


a0  +  x  (ak  +  x  ...  )) 


(a1  +  x  (aR+1  +  ...  )) 


(a2  +  x  (ak+2  +  ...  )) 


(ak-l  +X  (a2k-l  +  '••  )} 


£ Horner '  s  part 


Lemma  6 


Figure  2.10.   k-th  Order  Horner's  Rule 


The  minimum  number  of  steps  h  .   to  compute  a  polynomial  p  (x)  of 


mm 


degree  n  by    the  k-th  order  Horner's  rule   (l  <  k  <  n  +  1)    is 

hP.      =    log  nl    +    riog(n+iJl    +  i. 
mm  °  t>\         / 


Proof: 


It  is   enough  to  show  that 

2  log  ml    +  2^/mj    +  1  >  2    riog(n+l)1     +  1 


1+2 


or 


flog  m1    +    |_n/mj    >    flogtn+l)1 
for  m  <  n,    because   2flog(n+l)'    >    flog  ir    +    'log(n+l)'    and  if  m  >  n  +  1,    then  by 


Theorem  5  lr  (n+1.  n)   =  h    .    . 

man 


Assume  that  n=2g+t(l<t<2S)   and  m=2     +  s(l  <  s  <  2k).      Then 
e  have    two  cases,    i.e.    (l)   g  =  k  or   (2)   g  >  k.      If  g  =  k,    then  t  >  s   since 


we 


n  >  m. 


(1)  g  =   k  and  t  v>  s. 

Then    log  m1    =k+l=g+l,    and 

Hence 

flog  m)    +   |_n/mj    =  g  +  2  >    flog  ( n+1  )1  . 

(2)  g  >  k. 

We  have  further  two  subcases,  i.e.  (i)  1  <  t  <  2  or  (ii)  t  =  2  . 

(i)  1  <  t  <  2g 

Then    flog  (n +1)1    =  g  +  1,    Q.og  nil    =  k  +  1 
and 


Isl 


2B  +  t 


2k   +   s 


>  2e 


k   -   1 


Hence 

flog  ml    +   (_n/mj    >  k  +  1   +  2g"^k+1') 
Now  we   show  that   for  all  k  <  g 

f(k)   =  k  +  1  +  2g"(k+l)s  g  +  1. 


*o 


Since  for  all  k  <  g, 

|^f(k)  =  1  -  (loge2)  2g"(k+l)  <0 

minf(k)=g+l   where  0  <  k  <  g. 
Hence  f(k)  >  g  +  1  or 

log  ml    +    [n/mj    >  g  +  1  =    riog(n+l)l  . 

(ii)     t  =  2g 

Similarly  to  the  above,   we  can  show  that 

'log  nil    +    in/mj   >    U-Og(n+l)l  . 
The  details  are  omitted. 

(Q.E.D. ) 

Unlike  the  case  of  h  (m,n)  and  h  (m, n),  h  (m, n)  is  not  necessarily  a 
non-increasing  function  of  m.   (A  few  curves  in  Figure  2.11  illustrate  this.) 

Therefore  it  becomes  important  to  choose  an  appropriate  m  for  a  given 
n  to  compute  in  an  optimum  way. 

Theorem  6 : 

Given  an  n-th  degree  polynomial. 
Let 


M  -  M(hP(m,n)   -    flog  n"l    +   fl0g(n+l)1    +  1), 


where 


Then 


hP  ( m,  n )   =  2  flog  ml    +  2  ^n/mj    +  1 . 


kk 


ft 


ft 

-P 

OQ 

O 
U 


H 


n  =  5 


10 


15 


Number  of  PE's    (m) 


Figure  2.11.      The  Number  of  Steps,    h   (m, n),    to  Compute  p   (x)   on  P(m)  by 
the  ra-th  Order 


h5 


(1)  n  +  1 

M  =-<    (2)  ^n+l)/^ 

J3)  WV21 

where  g  =    ^log  ry  • 


(n  =  2g) 


-1, 


(2*  <  n  <  2*   +  2fe      ) 
(2g   +  2S"1  <  n  <  2g+1) 


Proof: 


Proof  is  given  for  each  case  independently. 

(1)  n  =  2g. 

The  proof  is  divided  into  two  parts.   First  it  is  clear  that  if 
we  have  n  +  1  PE's,  then  p  (x)  can  be  computed  in  h  .   steps 

(see  Theorem  5)-   Next  we  show  that  if  the  number  of  PE's,  m,  is 

less  than  or  equal  to  M  -  1  (m  <  n),  then  Ir  .   <  Ir  (m, n),  where 


nun 


hP.      =  flog  nl    +    fiog(n+lJl    +  1  =  2g  +  2. 
man  ov        '  to 


Let  m  =  2k  +  p        (k  <  g,    1  <  p  <  2k).      Then 


hP(m,n)   =  2k  +  5  +  2 


sk+P 


Let 


P  = 


2g"k  p 
2k   +  p 


Then 


(18) 


¥ 


P  = 


(2k+P)2 


>  0, 


and 


kG 


for 


max  P  =  2g_k~1     for  1  <  p  <  2k. 
From  this  and  Eq.  (l8),  we  get 

hP(m,n)  >  2k  +  3  +  2  |2g~k~"!j  .  (19) 

Since  g  >  k,  Eq.  (19)  becomes 

hP(m,n)  >  2k  +  3  +  2g"k. 
Now  let 

f(k)  =  (2k  +  3  +  2g"k)  -  (2g  +  2)  =  2(k  -  g)  +  1  +  2g'k. 

Since  2a  +  1  -  2a  >  0  (note  that  v~(2a  +  1  -  2a)  =  log  2  ■  2a 

da  °e 

2  >  0  for  a  >  l),    we  have   f(k)  >  0  for  all  k  <  g.      Since 

hP(m,n)    -    (2g   +  2)   >  f(k)   >  0, 

hP(m,n)  >  2g  +  2  -  hp.n. 

This  proves    (l). 

Now  since    "log  n^  'log(n  +  l)'     if  n  /  2     for   some  g,    we  use 

hP.      =   2("log(n  +  lp    +  1 
mm  to 


hP.      =    H.og(n  +  l)1     +    log  n~l    +  1 
mm  to 

to  prove  (2)  and  (3)-   Then  it  is  enough  to  show  that 

=   u(  "log(n  +  l)      =    "log  nr    +    l^/mj  ) 
instead  of 

M  =   tJ(2flog(n  +  1^     +1=2  flog  nil    +  2  (n/mj    +  l)« 

Since  2g  <  n  <  2g+     for   (2)   and   (3),    we  have    riog(n  +  l)"1    =  g  +  1.      By  direct 


+7 


computation,  we  can  show  that  the  theorem  holds  for  n  <  10  (see  Table  2.k). 

(2)   2g  <  n  <  2S  +  2g_1. 
Now  we  show  that 

hP.   -  2 log  M*l  +  2  |n/M»  +  1. 
mm  u  /  _j 

To  show  this  we  first  show 

flog  M~l   +   (ji/Mj   =  g  +  1. 

Since  2g  <  n  <  2g  +  2g"  ,  we  have 

2g-2  <  n_+_l_  <  2g-l  (2Q) 

and 

log  Ml    =    flog  ^n  +  1)/3H     =  g  -  1.  (21) 

Now  let      (n  +  l)/y    =  k        (k  >  3  as  we  assumed).      Then 

n  +  1  -   3k  -  p  (p  <  2) 

or 

n  =  3k  -  p  -  1.  (22) 

Using  this  and  the  relation  0  <  (p  +  l)/k  <  1,  we  have 

Ln/Mj  =2.  (23) 

Thus   from  Eq.    (21)   and   (23) 

flog  M1    +    ,n/Mj    =  g  +  1  =    flog(n  +  l)~l, 
or 

hP(M,n)   =  hP.    . 

mm 

Next  we  show  that  if  m'  <  l(n  +  l)/y  ,    then  HLog  m^  +  [n/m'j    > 
g  +  1  =  'log(n  +  1)1  (or  equivalent ly  hP(m',  n)  >  hp.  ). 


mm' 


k8 


m 


1 

2 

3 

k 

5 

6 

7 

8 

9 

10     11 

g 

M 

Case 

2 

it 

1+ 

k 

h 

1 

1 

3 

6 

5 

6 

5 

5 

2 

2 

k 

8 

7 

7 

7 

7 

2 

1 

5 

10 

7 

7 

9 

7 

7 

2 

1 

6 

12 

9 

9 

7 

9 

9 

7 

7 

1+ 

2 

7 

11+ 

9 

9 

7 

9 

9 

9 

7 

7 

h 

2 

8 

16 

11 

9 

9 

9 

9 

9 

9 

9 

9 

3 

1 

9 

18 

11 

11 

9 

9 

9 

9 

9 

11 

9 

9 

U 

1 

10 

20 

13 

11 

9 

11 

9 

9 

9 

11 

11    9 

9 

1+ 

1 

(1):      2S  <  n  <  2g  +  2g~1 

(2):      2g   +  2s"1  <  n  <  26+1 
where  g  =  |  log  nj . 
n:      The  degree  of  a  polynomial 
m:      The  number  of  PE's 
M:      The  minimum  number  of  PE's 


Table  2.k.      The  Number  of  Steps  Required  to  Compute  p   (x),    hp(m,n),    for  n  <  10 
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We  have  two  cases,    i.e. 

(i)     ?}  <  m'    <  21+1  where  i   +  1  <  g  -  2 

d   (ii)     2g~2  <  m'   <    r(n   +  l)/^. 

(i)      21  <  m'    <  2i+1. 

Since  i   +  1  <  g  -  2,    we  write 

g  -  2  =   i   +  1  +  j  (d  >  0). 

Then 

flog  m'"l    =i+l=g-2-j, 

and  since  2S  <  n  <  2g  +  2g_1, 

Ln/mJ    >2g-i-1=   22^. 
Thus 

(log  m*~l    +   |_n/mj    >   (g   -  2   -   ,i)    +  22+J"  >  g   +  1 
=    flog(n  +  1)1 

because  2     ;>  a  +  1  if  a  >  1. 

(ii)     23"2  <  m'    <  T(n  +  l)/fl. 
Let  us  write 

m'    -    Rn   +  1)/51    -    q  =   k   -    q  (q  >   l). 

Then  by  Eq.    (2(>)   and  (22),    we  get 

riog  mn  +  [_£,}  .  s .  i  +[3¥^J 

>  g  +  1  =    flog(n  +  l)', 
because   q  >  1,    p  <  2  and  3q  -  p   -   1  >  0. 

This   ends  a  proof  for   (2). 


50 

(3)   (2g  +  2g   <  n  <  2S+  )  can  be  proved  in  a  similar  manner  and  the 
details  are  omitted. 

(Q.E.D.) 

It  should  be  noted  that  the  function  Ir  (m,n)  is  not  a  non-increasing 
function  of  m  even  if  m  <  M  for  some  n.   However  if  n  <  50,  then  for  more  than 

70$  of  the  cases,  hp(m, n)  turn  out  to  be  non-increasing  functions.   (The  only 

cases  where  hp(m, n)  is  not  a  non-increasing  function  are  the  cases  n  =  15,  27, 

28,  29,  30,  31,  36,  37,  38,  39,  ^5,  ^6,  and  1+7 .   In  any  case,  hP(m,n)  increases 
by   at  most  one. ) 
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3.      TREE  HEIGHT  REDUCTION  ALGORITHM 

3.1  Introduction 

In  this  chapter,  recognition  of  parallelism  within  an  arithmetic 
statement  or  a  block  of  statements  is  discussed.   There  are  several  existing 
algorithms  which  produce  a  syntactic  tree  to  achieve  this  end.   The  tree  is 
such  that  operations  on  the  same  level  can  be  done  in  parallel.  Among  them, 
the  algorithm  by  Baer  and  Bovet  [6]  is  claimed  to  give  the  best  result.   For 
example,  a  statement 

a+b+c+dxe  xf+g+h 
can  be  computed  in  "our  steps  by  their  algorithm  (Figure  3*1) • 


k 

3 
2 

1 

level 


Figure  3-1-   An  Arithmetic  Expression  Tree  (l) 


The  algorithm  reorders  some  terms  in  a  statement  to  decrease 
tree  height.   However,  this  algorithm  does  not  always  take  advantage  of 
distributions  of  multiplication  over  addition.  An  arithmetic  expression 
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a(bcd+e)  takes  four  steps  as  it  is,  whereas  the  equivalent  distributed 
expression  (abcd+ae)  requires  only  three  steps.  A  further  example  is 
Horner's  rule.   To  compute  a  polynomial 

p  (x)  =  a^  +  a,  x  +  a^x  +  ...  +  a  x  ,  (l) 

n      0    12  n 

Horner ' s  rule 

p  (x)  =  a.  +  x(an  +  x(a,  +  . . .  x(a  . +x  a  )))  ...)         (2) 
*n       0      13         n-1    n 

gives  a  good  result  for  serial  machines.  However,  if  a  parallel  machine  is  to  be 

used,  (l)  gives  a  better  result  than  (2).   Namely,  if  we  apply  Baer  and  Bovet's 

algorithm  [6]    on   (2),    we  get  2n  steps  whereas   (l)   requires  only  2flogp(n+l)]    steps 

(see  Chapter  2).   Thus  it  is  desirable  for  the  compiler  to  be  able  to  obtain  (l) 

from  (2)  by  distributing  multiplications  over  additions  properly.  An  algorithm  to 

distribute  multiplications  properly  over  additions  to  obtain  more  parallelism 

(henceforth  called  the  distribution  algorithm  or  the  tree  height  reduction 

algorithm)  is  discussed  now. 

3-2   Tree  Height  and  Distribution 

Definition  1: 

An  arithmetic  expression  A  consists  of  additions,  multiplications,  and 
possibly  parentheses.  We  assume  that  addition  and  multiplication  require  the  same 
amount  of  time  ^see  Chapter  l).   Subtractions  and  divisions  will  be  introduced 
later.   Small  letters  (a,b,c, ...),  possibly  with  subscripts,  denote  single 
variables.   Upper  case  letters  and  t,  possibly  with  subscripts,  denote  arbitrary  • 
arithmetic  expressions,  including  single  variables,   t  is  used  to  single  out 
particular  subexpressions,  i.e.  terms. 


5h 


n 
Then  A  can  always  be  written  as  either  (l)  A  =  E  t.  or 

i=l  x 


(2)  A  =  w   (t.),  e.g.  A  =  abc  +  d  (ef+g)  =  t  +  t  where  t  =  abc  and 
i=l 

t  =  d(ef+g),  or  A  =  (a+b)c(de+f)  =  (t  )t  (t  )  where  t  =  a  +  b,  t  =  c 


and 


t-,  -  de  +  f .   Note  that  when  we  write  A  =  tt  (t.  ),  we  implicitly  assume  that  for 

i=l  x 

n(i) 
each  i  t.  =   Z  t.'.  h[A]  denotes  the  height  of  a  tree  T  for  A,  which  is  of  the 

minimum  height  among  all  possible  trees  for  A  in  its  presented  form. 

A  minimum  height  tree  (henceforth  by  a  tree  we  mean  a  minimum  height 
tree)  for  A,  T[A],  is  built  as  follows  [6]. 

n  n 

Let  us  assume  that  A  =   E  t.  or  A  ^  tt  (t.  )  and  that  for  each  i,  a 

.  ,  i        .  -,   l  ' 

i=l         i=l 

minimum  height  tree  T[t.]  has  been  built.   Then  first  we  choose  two  trees,  say 

T[  t  ]  and  T[  t  ] ,  each  of  whose  height  is  smaller  than  height  of  any  other  tree.  We 

combine  these  two  trees  and  replace  them  by  the  new  tree  whose  height  is  one  higher 

4 

than  max  (h[t  ],  h[  t  ]]  :w 

Combined  tree 


i. 


A^Vli 


hrV+1    X  AhI^ 

t       t       t      r 

I       p      q 

This  procedure  is  repeated  until  all  trees  are  combined  into  one  tree,  which  is 
T[A].   The  procedure  is  formalized  as  follows: 


4 

"Figures  are  in  scale  as  much  as  possible. 
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(1)  Let  ST  =  -j(l),    (2),    (3),    -..,    (n)J-  and  let  h'[i]    =  h[t.]    for 
all   it  ST. 

(2)  Choose  two  elements  of  ST,    say  p  and  q  such  that  h'[p],    h'[q]    <  M" 
where  M  =  rain  jh'[u]l     for  all  u€ST-  -|p,  q|-  . 

(3)  Now  let  ST  =     jsT   -     |p,  qjj     U       j(p, q)|  and  h'[  (p, q)]    = 
max  jh'[p],    h'[q]|     +  1. 

(k)     If  I  ST  |  =  1,  then  stop  else  go  to  Step  2. 

After  we  apply  the  above  procedure  on  A,  we  get  e.g.  ST  =  i  ( (  ((l)(2) )  (3) )  ( (k)  (  5) ) )  )- 

where  a  pair  (ab)  indicates  that  trees  corresponds  to  a  and  b  are  to  be  combined. 
Thus  in  this  case  we  get: 

^((((l)(2))(3))(CO(5))) 

(((D(2))(3))2i^-^^^-->Nf (00(5)) 

((D(2)) 


as  a  minimum  height  tree  of  A.   In  general  the  procedure  is  applied  from  the  lowest 
parenthesis  level  (see  Definition  3)  to  the  higher  parenthesis  levels. 

]xample  1: 

Let  A  =  (a+bc)(d+efg)  +  hi 

=  *1  +  *2 


If  there  are  many  choices,  then  choose  those  subtrees  with  smaller 
-4(Sh[t  ])  values  first  (see  Definition  8). 
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where  t  =  (t  )(t^) 


t_,  -  a  +  be  =  t,  +  t,_ 
3  ^5 


and  t,-  =  d  +  efg  =  t„  +  tQ. 
b  7    8 


T[A] 


Tft3] 


g    h 


Figure  3-2.   An  Arithmetic  Expression  Tree  (2) 


The  <  ffective  length  e  of  an  arithmetic  expression  is  defined  as 

e[A]  .  2h[A1. 

The  number  of  single  variables  in  an  arithmetic  expression  is  the  number 
of  single  variable  occurences  in  it. 

The  height  of  a  minimum  height  tree  can  be  obtained  without  actually 
building  a  tree. 

Theorem  1: 


(1.1)   If  A  s  it   a.  or  A  =   Z  a.  then  h[  A] 
i=l  x       i=l  x 


■log2fplg 
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(1.2)      If  A  =     It.    then  h[A]    -   log 
i=l  2 


P 

£  e[t   ] 
i=l 


P  I 

(1.3)      If  A  =     tt  a.    x     7T  (t.)   then  h[A]    =  log 
i=l  X        3=1     J 


rPi0  +   z  ert.]  I  . 

2    J-l   J  '2 


Given  an  arithmetic  expression  A,  to  obtain  the  height  of  a  tree  for 
A,  Theorem  1  is  applied  from  the  inner  most  parts  of  A  to  the  outer  parts, 
recursively. 

Proof: 


(1)   It  is  obvious  that  if  A  =   Z  a.  or  A  =  tt  a.  then  h[A]  =  log  r  pi  . 

i=l         i=l 


(2)  Now  let  A  =   Z  t. .   Then  we  can  replace  each  t.  by  a  product  of 
i=l  x  x 

e[t.]    single  variables  without  affecting  the  total  tree  height  h[A]. 

(Note  that  each  t.  must  be  computed  before  the  summation  over  t.  is 

e[t.] 
p        L    iJ 

taken.)      Thus  A  becomes  A'    =      Z         tt     a .   and  h[A]    =  MA'].      Let  us 


i=l     j=l 


eft.] 
1 


call  a  tree  for  tt     a.  a  subtree.   Then  a  tree  for  A'  is  built 
,1=1  J 

using  subtrees  in  the  increasing  order  of  their  heights.   Since  a 


binary  tree  of  height  h  cannot  accomodate  more  than  2  leaves,  we 


have  2 


h[A'] 


>  Z  e[ t.l  >  2 
"  i=l   X 


h[A']-l 


or  h[A']  =  h[A]  =  log. 


P 

Z  e[  t  ] 
i=l 


(3)   can  be  proved  in  a  similar  manner. 


(Q.E.D.) 
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Definition  2: 

The  additive  length  a  and  the  multiplicative  length  m  of  an  arithmetic 
expression  A  is  defined  as  follows: 

P 

(2.1)  If  A  =  ir  a..,    then 

i=l  X 

(i)      a[A]    =   e[A]    and 
(ii)     m[  A]   =  p. 

P 

(2.2)  If  A  -      Z  t.,    then 

i=l  X 


P 
(i)      a[A]    =      I  e[t.]    and 
i=l       x 

(ii)      m[A]    =   e[A]. 

P  I 

(2.3)      If  A  =     tt  a.    x     7r  (t.),    then 

i=l  x       d=l     J 

(i)      a[A]    =  e[A]    and 


(ii)     m[A]    =  p  +     I  e[t   ]. 
0=1       J 


It  is  to  be  noted  that 


(!)     £[AI       |a[A]l 

2  '.m[A]j    - 


(2)  h[A]  =  logp  |a[A]  i   =  log   m[A]  I  ; 
d  !      *  2        '      '2 

compare  this  definition  with  Theorem  1. 

Definition  3: 

The  level  £  of  a  parenthesis  pair  in  an  arithmetic  expression  is  defined 
as  follows : 
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First  we  start  numbering  parentheses  at  the  left  of  the  formula,  pro- 
ceeding from  left  to  right,  counting  each  left  parenthesis,  (,  as  +1  and  each 
right  parenthesis,),  as  -1  and  adding  as  we  go.   We  call  the  maximum  number  m 
the  depth  of  parentheses.  Now  the  level  1   of  each  parenthesis  pair  is  obtained 
as  I  =   p,  where  p  is  the  count  for  each  parenthesis.   The  arithmetic  expressions 

enclosed  by  the  level  I   parenthesis  pair  are  called  the  level  I   arithmetic 

I 
expressions,  A  •  Also  for  convenience  we  assume  that  there  is  an  outermost 

parenthesis  pair  which  encloses  A. 


Example  2 


A  = 


123  3  3  3  21 

(ab((cd  +  e)(f  +  g)+k)) 


3 

3 

2 

Now  several  lemmas  are  in  order. 


Lemma  1 


n 


n 


Let  A  =      T  t.    or  A  =     ir  (t.).      Also  let  A1   =  t,    +  t.  + 
.    ,    l  .    ,      l  12 

1=1  1=1 


+  t.  '  + 

l 


...  +  t  or  A'  =  (t. )  x  (t)   x  ...  x  (t. ')  x  ...  x  (t  ),  and  A"  =  A  +  t  .  or  A"  = 
n  12'  l  n"  n+1 


A  x  (tn+1).   Then 


6o 


(i)     h[A']   >h[A]    if  h[t.']   >  h[t±] 
(ii)     h[A"]   >  h[A]. 


Proof: 


Obvious  from  Theorem  1. 

What  Lemma  1  implies  is  that  the  height  of  the  tree  for  an  arithmetic 
expression  is  a  non-decreasing  function  of  term  heights,  and  the  number  of  terms 
involved. 

In  an  arithmetic  expression,  there  are  four  possible  ways  of  parenthesis 
occurence: 

Px)   ...  +  (A)  +  ... 

P_)   ...  6(tn  x  t_  ...  x  t  )  x  (t  '  x  t  '  ...  x  t  ')  9   ... 

2  12        n      1     2        m 

P2)   . ..  0  a     x  a0  x  ...  a  x  (A)  8  . . . 

3  12       n 

pu)    ...  e(t1  +  t2  +  ...  +  tn)  x  (t^  + 12'  +  ...  +  tm»)  e  ... 

where  6   represents  +,  x,  or  no  operation. 

Lemma  2: 

Let   D  -   B  +  ( A)    +  C  and  Dn    =   B+(tn    x    ...   xt    )   x(t'    x    ...    xt    ')    +C. 

1  1  n  1  m 

Also  let  D     =  B  +  A  +  C  and  Dn    =B+tn    x   ...xt     xt'   x   ...   xt   '    +  C.      Then 

11  n         1  m 

r\  d 

h[D]   >h[D]    and  h[  D  ]   >  h[  D  ] . 

Proof: 

Obvious    Prom  Theorem  1. 
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As  an  example,  let  D=(a+b+c)+d  and  D  =  (abc)(defgh) .   Then 

A  A  A 

D     =  a  +  b  +  c  +  d  and  DJ  =  abcdefgh,   h[  D]   =  3  >  h[D  ]   =  2  and  h[D  ]   =  k  > 
h[Dj]    =  3- 


Lemma  > 


Let  D  =|    Z  t  .||    Z  t .  ]  and  Dd  =  t,  t  ,  '      •  t,  tn '     ■  ...   +tt'         ...        t  t 
,1=1  *A.1-1  « 

Then  h[Dd]   >  h[D]. 

Proof: 

n  m 

Let  D  =  (A)(B)  where  A  =  Z  t.  and  B  =   St.'.   Also  without  losing 

i=l  x        1=1  x 

nerality,  assume  that  h[A]  >  h[  B]  .   Then  h[D]  =  h[A]  +  1.   For  each  j,  let  d.  = 

J 

t.'   tn+t.'    t„  +  ...    +  t.'    t.      It  is   clear  that  h[  t . '    t.]   >  h[t.l    for  all  i  and 
j       1         0       2  j        n  j        iJ   -         iJ 

d         m  d 

Thus   from  Lemma  1,    we  have  h[d.]   >  h[A]    for  all  j.      Since  D     =      Z  d.,    h[D  ]    > 

3  0=1  J 

min(h[d  ])    +  log  J  ml     >  h[A}    +  log2rml2,    or  since  h[D]    =  h[A]    +  1,    h[  D  ]   >  h[  D]  . 

(Q.E.D.) 


Note  that  the  above  lemma  does  not  imply  necessarily  that  if  D  = 

ft    \  H"  H" 

Zt.HB),  and  D   =  t  (B)  +  t  (B)  +  ...  +  t  (B),  then  h[  D  ]  >  h[D].   Actually, 
i=l  7  n 

it  can  be  shown  that  there  is  a  case  when  h[  D]  >  hf  D  ] .   What  Lemma  3  says  is 
that  D  =  (A)(B)  should  not  be  fully  distributed,  but  partial  distribution,  as  in 
D  ,  may  be  done  in  some  cases. 
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Lemmas  2  and  3  together  indicate  that  distribution  in  case  (P^)  and 

partial  distribution  for  case  (P,  )  are  the  only  cases  which  should  be  considered 

for  lowering  tree  height.   In  casos  (P  )  and  (Pg),  removal  of  parenthesis 
leads  to  a  better  result  or  at  least  gives  the  same  tree  height.   Full 
distribution  in  case  (P.  )  always  increases  tree  height  and  should  not  be 

done.   Also  it  should  be  clear  that  in  any  case  tree  height  of  an  arithmetic 
expression  can  not  be  lower  than  that  of  a  component  term  even  after 

A  \     ■  d 

distributions  are  done.   For  example,  let  D  =  t(A)  =  t  x  J  F  t  J  and  D  = 

n 

F  ft  v  t.).      Then  from  Lemma  1,    we  have  h[tt.]   >  h[t   ]    for  each  i.      Thus 

i=l 

h[r>d]   >  h[A]. 

The  same  argument  holds  for  all  four  cases.   This  assures  that  evaluation 
of  distributions  can  be  done  locally.   That  is,  if  some  distribution  increases 
tree  height  for  a  term  then  that  distribution  should  not  be  performed  because  once 
tree  height  is  increased,  it  can  never  be  remedied  by  further  distributions. 
Actually,  there  are  two  cases  where  distribution  pays.   For  example,  if  A  = 
a(bcd  +  e),  then  hrA]  =  k.      However,  if  we  distribute  a,  then  we  get  A  =  abed  + 

ae  and  hrA  ]   3-   The  idea  is  to  balance  a  tree  by  filling  the  "holes"  because 
a  balanced  tree  can  accommodate  the  largest  number  of  variables  among  equal  height 
trees.   The  situation  is,  however,  not  totally  trivial,  because  by  distribution, 
the  number  of  variables  in  an  expression  is  also  increased.   Next  let  A  = 

a(bc  +  d)  +e=t+e  and  A  =  abc  +  ad  +  e  =  t  +  e.   In  this  case  h[A]  =  k   but 

h[A  ]  =  h[ t  ]  =3-   What  happened  here  is  that  t  is  "opened"  by  distributing  a 
over  (be  +  d)  and  the  "space"  to  put  e  in  is  created. 
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At  each  level  of  parenthesis  pair,  cases  (P3)  and  (P^),  i.e.,  instances 
of  "holes"  and  "spaces",  are  checked  and  proper  distribution  is  performed.   Next 
we  give  definitions  of  holes  and  spaces,  and  formalize  these  ideas. 

3-3  Holes  and  Spaces 

3-3«l  Introduction 

Before  we  proceed  further,  let  us  study  trees  for  arithmetic  expressions 

more  carefully. 

P 
Let  A  =   Z  t..   By  Definition  1  we  first  build  minimum  height  trees  T[t.] 
i=l  1  x 

for  all  i,    and  T[AJ    is  built  by  combining  these  T[t.].      Once   T[t.]    is  built  the 

details  of  t.  do  not  matter,  and  the  only  thing  that  matters  is  its  height  h[t.]. 

Suppose  T[t.]    and  T[t.]    are  combined  to  build  T[A].     Assume  also  that 

hrt.l    =  hrt.l    +  s.      Then  we  will  get   s  nodes   to  which  no  trees  are  attached  other 
l  3 

than  T[t.].      We  call  these  free  nodes  whose  heights   are  h[t.]    +1,    h[t.]    +2,    ..., 
J  J  J 

hrt.]. 
i 


T[t.: 

<?            T[t.]-1 

1 

<?         T[t    )+l 

X  -^ 

Ah[td] 

Free  Nodes 
(S) 


Figure  3-3-   Free  Nodes 


6k 


Similarly  we  can  enumerate  all  free  nodes  in  T[A]  with  their  heights. 


O   free  node 
£^  occupied  node 


root  of  T[  t .  ] 


root  of  T[A] 


Figure  ^>.h.      Free  Nodes  in  a  Tree 


Free  nodes  in  a  tree  T 


P 

it   (t.) 


L  i=l 


are  defined  similarly. 


Let  us  emphasize  that  once  we  get  Tft.]  we  treat  it  as  a  whole  and  do  not 

care  about  its  details  when  we  build  T[A].   That  is,  when  we  consider  free  nodes 
in  T[A]  we  mean  free  nodes  "in"  T[A]  but  "outside"  of  Tft.].   For  example  let  A  = 


(a+b)(cde+f)  =  (\){t    ).      Then 


T[A] 


d     e 
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a  and  3  are  free  nodes  in  T[A]  while  y   and  8  are  free  nodes  in  T[t  ]  and  not  in 
T[A]. 

Now  suppose  there  are  m  free  nodes  in  T[A].  We  number  them  arbitrarily 
from  1  to  m.   Also  let  us  denote  the  height  of  a  free  node  a  '»y  h[al« 
Given  a  free  node  q-  whose  height  is  hi"  a]  in  T  [A  =  I  t.] 

(or  T  [A  -  tt(  t .  )  ] ),  by  definition  we  can  attach  a  tree  T[t]  whose  height  is  h[a]-l 

(or  whose  effective  length  is  2  ^aJ"  )  to  a  without  affecting  the  height  of  A: 


Definition  k: 


For  A  =  7r(t.)  or  A  =  Zt . ,  we  build  a  tree.   Then 
l  i 

(k.l)      define  F  [A]  to  be  a  set  of  all  free  nodes  in  T[A],  and 

(U.2)   for  each  i  define  F  [A,  t.]  to  be  a  set  of  all  free  nodes  which 

r\      1 

exist  between  the  roots  of  T[A]  and  T[ t . ] ,  i.e.  the  free  nodes 
which  we  encounter  when  we  traverse  T[A]  down  to  the  root  of 
T[t.]. 

For  example  let  us  consider  the  following  tree  (see  Figure  J. 5)- 
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Figure  J5-5-  An  Example  of  F  and  F 


Then  FA[A]  =  fa,  3,  7,  6,  e]  and  FR[A,  t±]    =    fa,  3,  y),    FR[A,  tg]  =  { c&  3}  etc. 

Lemma  h: 

Suppose  h[  a]  =  h[  3]  for  some  free  nodes  a  and  3  in  T[A].  Then  without 
changing  the  tree  height  h[A],  we  can  replace  two  free  nodes  a  and  3  hy  one  free 
node  3'  whose  height  is  h[a]+l. 


Proof: 


original  T[A] 


Figure  3-6.   Elimination  of  a  Free  Node 
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modified  T[A]        N 


Figure  3- 6-   (continued) 


We  can  combine  subtrees  1  and  3,  and  hence  eliminate  free  nodes  a 
and  3  and  create  a  new  free  node  3'  (see  Figure  3-6). 

(Q.E.D.) 

Given  F.[A],  two  free  nodes  a  and  (3  of  equal  height  can  be  replaced  by 
one  free  node  31  whose  height  is  h[  a]  +  1«   Repeating  this  procedure  finally  we 
get  a  new  F  '[A]  in  which  no  two  free  nodes  have  the  same  height.   Let  y   and  8 
be  free  nodes  in  F.[A]  and  F  '[A]  respectively.  Assume  that  for  all  free  nodes  a 
in  F  [A]  (or  F  '[A]),  h[  7]  >  h[  a]    (or  h[61  >  h[  a]  ) .   Then  obviously  h[  7]  <h[5]. 
However,  it  is  clear  that  h[6]  <  h[A],  i.e.  if  h[6]  =  h[A]  then  h[A]  is  not 
the  height  of  a  minimum  tree  (see  Figure  3*7 )•   Hence  we  have  the  following 
corollary. 


'orollary  1: 


In  a  minimum  height  tree  T[A], 


I       2h[a]"1  <kA]. 
C*FA[A] 
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Proof: 


Assume  that     S  2h[a]~1  =  |e[ 

aeF  [A] 


re[A].   Then  by  Lemma  h,    alia  in  F  [A] 


can  be  replaced  by  one  free  node  3  whose  height  is  h[A]  (note  2  *■  ■•"  =  e[A]/2) 
This,  however,  implies  that  we  can  build  a  tree  T'[A]  whose  height  is  h[A]-l, 
which  contradicts  our  assumption  that  T[A]  is  a  minimum  height  tree  for  A. 


T[A] 


Figure  3«T«  A  Minimum  Height  Tree 


(Q.E.D.) 


The  following  definitions  are  also  used  in  the  next  section. 


Definition  5: 

An  integer  set  is  a  set  of  integers  with  possibly  duplicated  elements, 
e.g.  (2,2,2,^,8,8).   If  an  integer  x  occurs  in  an  integer  set  Y  at  least  once  then 
we  write  x  in  Y,  e.g.  2  in  {2,2,2,^,8,81  .   Let  Y  and  Z  be  two  integer  sets.   Then 

by  Y  uni  Z  we  get  an  integer  set  where  if  an  integer  x  occurs  i  times  in  Y  and  j 
times  in  Z,  then  x  occurs  i  +  j  times  in  Y  uni  Z,  e.g.  [2,k,k)    uni  (2,^,8)  = 
[2,2,k,k,k,8)  .   Also  if  Y  is  an  integer  set,  then  #(Y)  =  (the  sum  of  values  of 
all  elements  of  Y),  e.g.  #({2,k,k) )  -  10,  and  le  Y  =  (the  value  of  the  largest 
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element  :ln  Y),  e.g.  le  (2,4,4)  =  4.   Furthermore  if  x  in  Y,  then  by  Y  del  {x}  we 
mean  the  integer  set  which  is  obtained  by  deleting  one  occurence  of  x  from  Y,  e.g. 
(2, 4,4,8,  16, 16}  del  (4)  =  (2,4,8,16,16). 

P 

Let  Y, ,  Y„,  ...,  Y  be  integer  sets.   Then  by  M  (d)  Y.  we  mean  the  set 
r     2      p  u     l 

i=l 

Y  constructed  as  follows. 

(1)  Let  Y  =  0  (empty)  and  s  =  d. 

(2)  If  s  =  0,  then  stop  else  go  to  (3). 

(3)  Let  u  -  min(le  Y.)  and  k  be  an  index  of  Y.  such  that  u  =  le  Y,  . 

—  i  i  —  k 

If  there  are  more  than  one  k  which  satisfy  this,  then  pick  an  arbitrary 
one.   Now  let  Y  =  Y  uni  (ul  and  s  =  s  -  u.  Also  let  Y  =  Y,  del  (u)  and  Y.  = 

(Y.  del  {le  Y.)  )  uni  {  le  Y.-ul  for  all  i  (i^k).   Go  to  (2). 

For  example  let  Y±  =   (1,2,2,8)  ,  Y2  =  (4,4,8)  and  Y,  =  {16}  .   Then  Y  = 

3 

•• 

i=l 


(13)  Y.  is  constructed  as  follows, 
u       1 


(1)  Y  -  0  and  s  =  13- 

(2)  le  Y±   =  8,  le  Y  =  8,  le  Y  =  16.   Hence  u  =  8  and  k  =  1.   Then  Y  = 

(8^  and  s  =  13  -  8  =  5-   Also  Y-[  =  (1,2,2),  X^  =   {4,4}  and  Y,  =  (8) 

(3)  le  Y  -  2,  le  Y  =  4,  le  Y,  =  8.   Hence  u  =  2  and  k  =  1.   Then  Y  = 
(2,81  and  s  =  5  -  2  =  3-   Also  Y1  =  (1,2),  Yg  =  (4,2),  Y^  =    {6}  . 

(4)  Repeating  the  procedure  we  finally  get  Y=  (1,2,2,8). 

lefinition  6; 

Let  m  be  an  integer.   Now  write  m  as  a  sum  of  powers  of  2  in  which  each 

power  appears  at  most  once.   Then  by  5b(m)  we  mean  the  integer  set  of  powers  of 
2  which  appear  in  the  above  sum. 


TO 


3    2    0 
For  example  since  13  =  2^+2  +2  =  8+4+1,  I  b(l3)  =  {1,4,8}  but 

Fb  (13)  4  {l,k,k,k}    since  k   appears  more  than  once. 

3.3.2  Holes 

Now  we  discuss  holes  in  an  arithmetic  expression.   Intuitively,  if  an 
arithmetic  expression  A  has  a  hole  of  size  u  then  an  arithmetic  expression  t'  with 
e[t']  <  u  may  be  distributed  over  A  without  increasing  tree  height. 

Definition  7» 

For  each  A  =  7r(t.)  we  define  a  total  hole  function  Hm  and  for  each  t.  in 

A  =  r  t.  we  define  a  relative  hole  function  H^  as  follows: 
1  R 

(7.1)  For  each  A  =  7r(t.),  build  a  tree  T[A].   Then  define  H  [A]  = 

I  2h^  ^"1   for  all  a  e  F  [A]  . 

(7.2)  For  each  A  =  Ft.,  build  a  tree  T[A"|.   Then  for  each  t.  define 

HR[A,t.]  =  T  2h[a]_1   for  all  ae  FR[A,t.].   If  FRrA,t.]  =  0, 

then  let  HJA.t.]  =  0. 
R1    iJ 

As  stated  before  if  a  is  a  free  node  in  T[A],  then  2  "•  ^ "  is  the 
effective  length  of  a  term  t'  whose  tree  T[t']  can  be  attached  to  a  without  chang- 
ing the  tree  height  h[  A] .   Also  let  u  in  Zb(HJA]).   Then  this  implies  that  there 

is  a  free  node  q  in  FA[  A]  such  that  2       =  u  (see  Lemma  k) .   Similarly  if  u  in 
T  b(HjA,  t.])  then  there  is  a  free  node  n   in  FjA,t.]  such  that  2  ^Q'~  =  u.   Thus 

K  1  n  1 

in  general 

(1)      if  A  =  7r(t.  ),    then  h[  (t1  )x  (A)]    =  h[A]    if  m[t']    <  u,    and 
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(2)      if  A  =  2  t.,    then  h[t'+A]    =  h[  A]    if  a[  t '  ]   <  u,    where  u  = 

I       2h[a]"1. 
aeFA[A] 

Definition  8: 

The  set  of  holes  Sh[A]  for  an  arithmetic  expression  A  is  an  integer  set 
defined  as  follows: 

P 

(8.1)  If  A  =     ir  a.,    then  we  let  Sh[A]    =  (Zb(e[A]-p)}  ,    e.g.    Sh[abcde]    = 

i=l  X 

(a(3))  =  (i,2). 

p  p 

(8.2)  If  A  =     7T  (t.),    then  we  let  Sh[  A]    =  yiii(Sh[  t .  ]  )  uni   Z  b(H  [  A] ), 

i=l     1  i=l  it 

e.g.    Sh[  (abc+d)(efg+h)(i+j)]    =   Sh[abc+d]   uni  Sh[  efg+h]    uni   Sh[i+j] 
uni    [Zb(H  [A]))  =    {1}    uni   (1)    uni   Zb(l^)   -   (1,1,2,3,8). 

P 

(8.3)  If  A  =      Z  t.,    then  we  first  let   Sh'[t.J    =   Sh[t.]    uni   Z  b(H_[A,t.]) 

,  i  1        i  - ~ ~~      H    i 

1  =  1 

P 
and  d  =  min  (#(Sh'[  t.  ] ) ) .   Then  we  let  Sh[A]  =  Mu(d)  Sh'[t.]. 
i  i=l 

Example  2 : 

Let  A  =    (a+b)(c+def)    +  ghi   =   (t1)(t2)   +  t,  =  t^  +  t,.      Then  Sh[  t   ]    =   (1} 

and  Sh[(t1)(t    )]    =   (11    uni   I  b(6)   =   {1,2,  k)  .     Also  Sh[  t   ]    =   {1}.      Hence   Sh'ft^]    = 

(1,2,41    and  Sh'[t   ]    =   ( 1)    uni  Z  b(12)    =   [l,h,Q]  .      Now  d  =  min  ( (#Sh'[  t,  ]  ), 

#(Sh'[t3]))  =  7. 

Hence  Sh[A]    =       ^    (7)   Sh'[t.]    =  (1,2,4). 
1-3A 
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In  (8-3)  above  note  that  for  every  i 

(1)  if  u  in  Sh[A],  then  there  is  u.  in  Sh'[t.]  such  that  u  <  u., 

(2)  if  u  in  Sh[A],  then  u  <  #(Sh'[t.]).  Also  there  is  at  least  one 
k  such  that  le  Sh[A]  =  le  Sh'[t  ]. 

q 

Given  an  arithmetic  expression  A,    and  t'    =     £  t.',    if  e[t']    <  le  Sh[A], 

i=l 

then  t'  way  be  distributed  over  A  without  increasing  tree  height.   Informally 

we  say  that  A  has  a  hole  which  can  accomodate  t1. 

Now  we  show  that  the  above  assertion  is  indeed  valid.   First  we  observe 

q 

that  if  t'  =  7T  (t.'),  then  each  t.  '  may  be  distributed  independently  over  A. 
i=l  X  X 

q 

Thus  in  the  following  we  can  assume  that  t*  =  £  t.'.   Note  that  if  t  can  be 

i=l  X 

distributed  over  A  without  increasing  tree  height,  then  any  t'(e[t']  <  e[ t] ) 

can  be  distributed  over  A  as  well. 


Lemma  5: 


P 
Let  A  =     ir  a..      Then  h[A]    =  h[(A)  x    (t*)]    if  e[t']    <  le  Sh[A]. 


i=l 

Proof:   Obvious  by  Theorem  1 


Theorem  2 


p  -  d 

(1)      Let  A  =     7T  (t.).      Then  h[A]    =  h[  (t'A)    ]    if  e[t*]   <  le  Sh[A] 


i-1     x 
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P  _  d  # 

(2)      Let  A  -      It..      Then  h[A]    =  h[(t'A)    ]  if  e[  t '  ]   <  le  Sh[  A] 

i=l  X 


Proof: 

We  use  a  mathematical  induction  to  prove  the  above  theorem.   Lemma  5 
serves  as  a  basis. 

P  P 

Let  A=  7T  (t.)  or  A  =   St..  Assume  that  the  theorem  holds  for  t.. 

.  ,   l         .  -i  i  l 

i=l  i=l 

P 

(1)  A  =   7T  (t  ). 

i=l 

We  show  that  if  e[t']  <  le  Sh[A]  then  A  can  be  multiplied  by  f 
without  increasing  tree  height.   There  are  two  cases: 

(i)  There  is  k  such  that  eft1]  <  le  Sh[  t  ] .   Then  we  dis- 

tribute  t'  over  t,  without  increasing  tree  height. 

(ii)     V     e[t']   >leSh[t   ].      In  this   case  e[  f  ]   <  le   Z  b(H  [A"]  ) . 

Then  there  is  a  free  node  a  in  FA[A1    sucn  that  u  =  2 
and  e[t']   <  u  means  that  T[t']    can  be  attached  to  a  with- 
out  increasing  tree  height. 
Hence  A  can  be  multiplied  by  t'    without  increasing  tree  height. 

P 

(2)  A  =     I  t    . 

i=l 

We  show  that  if  e[t']  <  le  Sh[A]  then  t  can  be  distributed  over  A 
without  increasing  tree  height. 


#  -  d 

We  write  (ft)   for  the  expression  obtained  after  distributing  f  over 


t  such  that  the  tree  height  is  deduced,  e.g.  (a, b+cd)  =  ab  +  acd. 


7^ 


First  note  that   for  all  i,    if  a€F  [A,t.]    then  2h^a^_1  >  le  Sh[t.] 

Now  assume  that  e[t']   <  le  Sh[A].      For  fixed  k,    that  u  =  le  Sh[A] 
implies   either 

( i )     u  <  le  Sh[  tk] 

or  (ii)  there  is  a  in  F_,[A,  t,  ]  such  that  u  <  2  "- a^    (or 

equivalently  u  <  le  Ib(H  [A,t  ])). 

In  the  first  case  we  have  h[  t  ]  =  h[(f  t  )  ]  by  assumption  and 

h[A]  -  h[  7  t  +  (ft  )d]. 
i/k  X  k 

In  the  second  case  we  attach  T[ f ]  to  a   (Figure  2.8(a))  without 

increasing  the  tree  height  h[A],  i.e. 

h[A]  =  h\    T  t  +  (f  )  x  (t  )]. 
i/k  x  * 

In  general  let  1"  be  a  subtree  in  T[A]  whose  root  is  a   (Figure 

3.8(b)).   Then 


\  T* 


(a)  (b) 

Figure  3-8.   Attachment  of  T[  f ]  to  a  Free  Node 
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there  may  be  other  term  trees  besides  T[  t   ]    in  T",    e.g.    T[t.]    and 

T[t   ]    in  Figure  3-8(b).      Hence  we  get, 

h[A]    =  h[        L       t.    +  (f  )   x   (t^+t.+tj^)] 
i#c,j,h 

in  this  case.   Note  that  a  is  also  in  F  [A,  t.]  and  TJA,  t  ]. 

a        j      a        n. 

Repeating  this  procedure  for  all  k  we  can  get  an  arithmetic  expres- 

P 
sion  equivalent  to  X  (t1)  x  (t.)  or  (t')(A)  without  increasing 

i=l        x 

tree  height.   This  proves  (2). 

It  is  obvious  that  in  both  (l)  and  (2)  if  e[t']  >  le  Sh[A],  then  t'  can 

not.  be  distributed  (multiplied)  over  A  without  increasing  tree  height. 

(Q.E.D.) 

Lemma  6: 

Let  A  =   F  t.    and  e[t»]    <  le  Sh[A].      Then  after  t'    is   distributed  over  A, 

we  have 

Sh[  (t'A)d]  =  (Sh[A]  del  (u}  )  uni  T   b(u-e[t']  )  uni  Sh[t'] 
where  u  is  the  smallest  element  in  Sh[A]  bigger  than  e[t']. 

Now  let  us  summarize  what  we  have  so  far.   Let  A  and  t'  be  arithmetic 
expressions  where  A  =  T   t.  and  t'  =  7t'..   If  e[t']  <le  Sh[A],  then  t'  can  be 

distributed  over  A  without  increasing  tree  height,  i.e.  h[(t')  x  (A)]  > 

h[(t*A)   =  hfA].  A  set  of  holes  in  (t'A)   is  given  by  the  above  lemma.   Since  it 
is  obvious  that  le  Sh[  A]  <  e[A],  we  have  the  following  lemma. 
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Lemma  J: 

Let  f    =  Zt.'.    Then  h[(t')(A)]  >  h[(t'A)d]   implies  that  h[t']   <  h[A]. 

q 

In  general  let  t'  =  Ti(t.').   For  convenience  let  us  assume  that  e[t.  '  ] 

<  e[t  ']  <  e[t  ']  <  ...  <  e[t  '].  Then  if  the  following  procedure  can  be 
-    2   -    3   ~     -    q 

accomplished  successfully,  we  say  that  A  has  holes  to  accomodate  all  t..  '  (i=l, 
2,...,q).# 


Procedure: 


i 


(1)  Let  V  =  (ti'(ti_1*,...,(t1,A)d)d...)d  and  VQ  =  A.   Let  i  =  1. 

(2)  Check  if  e[t.']  <  le  Sh[V.  .]. 

i l-l 

(3)  If  so,  then  distribute  t. '  over  V.   ,  and  we  have  V.. 

(k)     Evaluate  Sh[V.]. 

(5)  If  i=q  then  stop,  else  let  i  =  i  +  1  and  go  to  (l). 
The  procedure  may  be  accomplished  successfully  if  m[t']  <  #(Sh[A]). 

3-  3.  3  Space 

Now  the  second  possible  distribution  case,  i.e.  space  is  studied. 

The  idea  of  the  second  distribution  case  is  that  given  an  arithmetic 

expression  D  of  the  following  form 

D  =  ...  9  (f)  x  (A)  +  t  e  ... 

1  s 

distribute  t'  over  A  so  that  t  can  be  hidden  under  the  combined  tree  as  shown 

s 

in  Figure  3.9* 


We  assume  that  each  t. '  does  not  have  any  holes,  i.e.  Sh[t.']  =  0 

for  all  i.   Hence  Sh[(t*A)d]  =  (Sh[A]  del  {  u  })  uni  b(u-e[t'])  for 
example . 
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t'^   t't2 


(a) 


A  A       A     A 


ft 


(b) 


Figure  3.9.  An  Example  of  Space  (l) 


In  other  words,  in  case  of  D,  the  addition  +,  cannot  be  done  before  (f )(A) 
(we  write  f (A))  is  computed  while  in  case  of  D  it  may  be  done  earlier. 

Note  that  if  h[f  (A)]  >  h[(f  A)  ]  then  A  has  enough  holes  to 
accomodate  f  and  the  distribution  of  f  over  A  is  done  anyway.   Henceforth 
throughout  the  rest  of  this  section  we  assume  that  A  does  not  have  any  holes 
to  accomodate  f .   Thus  we  now  deal  with  the  case  when  h[f (A)]  <  h[(f A)  ]. 
However,  if  h[f (A)]  <  h[(f A)  ]  holds,  then  clearly  there  is  no  way  to 
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get  h[J)  ]  <  h[  D]  by  Lemma  1.   Thus  we  have: 


Lemma  8: 


Proof: 


(1)  h[t*  (A)]  =  h[(t*A)d]  must  hold  to  get  h[Dd]  <  h[  D] . 

n  _  d 

(2)  Let  A  =    it..      Then  h[t'(A)]   >  h[(t'A)   ]    if  and  only  if  e[t']   < 

i=l  X 

e[A]    (i.e.   h[tf]   <  h[A]).      This   implies  that  h[t'(A)]   =  h[A]    +  1. 


By  inspection. 

P 
Intuitively  the  space  Sp  in  an  arithmetic  expression  A  =  it.  with 

i=l  1 

respect  to  t'    is   defined  as 

P 
Sp[A,f]    -   e[(A)   x   (t')]    -      I  e[(f)  x   (t    )]. 

i=l 

For  example  let  A  =  ab  +  c  and  t*  =  d.   Then  S  [A, t']  =  2. 


free  node 


(a)  (b) 

Figure  3.10.  An  Example  of  Space  (2) 

Let  D  =  t'(A)  +  t  -  (t'A)   +  t.   Note  that  a  free  node  in  T[t»(A)]  cannot  be  used 

to  attach  T[  t]  while  a  free  node  in  T[(t'A)  ]  may  be  used  to  attach  T[t]. 
Now  the  formal  definition  of  space  follows. 
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Definition  9: 

p  q 

Given  arithmetic   expressions  A  =   Ft.    and  t'    =   Ft.  '  ,    the  space  function 

i=l  i=l 

Sp  of  A  w.r.t.    t'    is   defined  as   follows.      First  we  build  trees  T[A]    and  T[t'],    and 

in  T[A]    let  F  be  a  set  of  free  nodes   f  higher  than  T[t']    (h[f]   >  h[t']).     Also  we 

define  a  set  I  as   follows.      We  let  iel   if  h[t.]   >  h[t']    and  e[  V ]   <  le  Sh[ t. ] . 

Now  the   space   function  is   obtained  as : 

h[f]         h^V 

Sp[A,f]    =      Z  2  L    J    +     T  2 
fe  F  i€  I 

To  show  how  Definition  9  works,    we   first  describe  how  to  build  T[(t'A)    ] 
by  attaching  T[t']    to  T[A]   properly  (i.e.   by  distributing  t'    over  A  properly). 

Since  h[(t'A)    ]    =  h[A]    +1   (Lemma  8(1)),    we   first  study  to  build  T'[A]   which  is 
obtained  by  replacing  T[t.]    in  T[A]   by  T'[t.]   whose  height  is  h[t.]    +  1.      Then  the 

height  of  T'[A]    is  h[A]    +  1.      Building  T[(t'A)    ]    from  T[A]   may  be  explained  in  an 
analogous  way. 

As   stated  before  the  only  case  to  be  considered  is  when  h[t'(Aj]    = 

h[(t'A)d]    =  h[A]    +1  holds    (Lemma  8(1 )).      Suppose  that  all  T[t.]    in  T[A]    are  re- 
placed by  T'[t.]   whose  height  is  h[t.]    +  1.      Then  the  new  tree  T'[A],    whose  height 

is  h[A]    +  1,    is  obtained.      Note  that  a  free  node  a  in  T[A]    now  becomes   a  free  node 
a' in  T'[A]    with  height  h[a]    +  1.      In  T'[A]    if  T'[t.]    is  replaced  by  T[t.]    again, 

then  a  new  free  node  3',    whose  height  is  h[t.]    +1,    is   created.      Having  these 
facts   in  mind,    now  we  describe  the  way  t'    is   distributed  over  A  to  create   space. 

The  tree  T[(t'A)d]    is  built   from  T[A]    as   follows    (note  h[(t'A)d]    = 
h[A]    +1).      Depending  on  the  height  of  T[t.],   we  have  two  cases. 

J 


8o 


(1)  h[t.]  >  h[f  ]. 

If  T[t.]  has  a  hole  to  put  T[t'],   then  we  fill  it  by  T[t'].      In 

J 

this  case  h[t'(t.)]  =  h[t J  and  a  new  free  node  3'  whose  height  is 

h[t.]  +  1  is  created  in  T[(t'A)  ].  Otherwise  h[t'(t.)]  =  h[t.] 
J  J       J 

+  1. 

(2)  h[t.]  <  h[t']. 

J 

Find  the  tree  T[Zt  ]  whose  height  is  h[t']  and  which  includes 
s 

T[t.].   We  multiply  Zt  by  t'  and  get  h[t*(Zt  )]  =  h[Zt  ]  +  1. 
J  s  s        s 

Note  that  t.  and  Zt  are  treated  as  terms  of  A.   In  the  resultant  tree  T[(t*A)  ], 
j       s  J' 

those  free  nodes  in  T[A]  whose  heights  are  less  than  or  equal  to  h[tT]  (i.e. 

free  nodes  in  T[Zt  ] )  do  not  appear, 
s 


(a)  (b) 

Figure  3-H-   Distribution  of  t'  over  A 


(c) 


.m*- 


A  free  node  a  in  T[A]  (h[a]  >  h[t'])  appears  in  T[(t*A)  ]  as  a  free 
node  a'  where  h[a'  ]  =  h[a]  +  1.   Thus  T[(t'A)  ]  has  those  a'  and  3'  described  in  (l) 
as  free  nodes. 


81 


If  6'    is  a  free  node  in  T[  (tfA)d],    then  a  tree  T[t]    (h[t]   <h[6']    -  1) 
can   De  attached  to  T[(t'A)    ]   without  increasing  tree  height    (i.e.   h[(t'A)    ]    = 
h[(t'A)      +  t]).      Since  h[6']    is   either  h[t.]    +  1  or  h[a]    +  1,   we  have  e[t]   = 

J 

2       J     or  2h^]. 

In  general  if  a[t]    <  Sp[A,tf]    then  we  have  h[t'(A)]    =  h[(t'A)d]    = 
h[(t'A)d  +  t]. 

q 

Definition  9  niay  be  generalized  to  include  the  case  where  t'    =  ir(t.'). 

In  this   case  we   first  obtain  h[t']    and  e[t.']    (i=l, 2,  . . .,  q),    and  build  T[A"|. 

In  T[A]    let  F  be  a  set  of  free  nodes   f  which  are  higher  than  h[t']    (h[ f ]    > 
h[t']).      Also  let  I  be  a  set  such  that  i  is   in  I  if  h[t.]    >h[t']    and  T[t.]   has 

enough  holes  to  put  all  T[t.']    (i.e.    a[t']    <  /Sh[t.])).      Then  Sp[A,tf]    = 

feF  iel 

Informally  we  say  that  space  to  put  t  is  created  by  distributing  t' 

over  A  if  h[t'(A)  +  t]  >  h[(t'A)   +  t] .   Then  the  procedure  is  called  space 

filling.   Now  we  study  how  much  we  can  reduce  tree  height  by  space  filling. 

Let  B  =  Z't.')(A.)  +  Zt .  and  assume  that  by  distributing  t. '  over  A.  space  can 
li.  j  l       l 

i  J 

be  created  for  all  i. 


Lemma  9" 

Let  B  =  Z(t.  ')(A.)   +  It.   and  Bd  =  l(t.  'A.  )d  +  Ft..      Then  h[Bd]    =  h[  Bl 
.11  .3  .11  .j  LJ  LJ 

i  0  i  3 

1  if  l!3p[A.,t   ']    >a[B]    -   e[B]/2,    otherwise  h[  Bd]    =  h[  B] . 

i 
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Proof: 

First  note  that  to  lower  the  height  of  a  tree  for  B,  some  terms  must 
be  removed  from  B  so  that  an  effective  length  of  a  resultant  expression  becomes 

e[B]/2.   Hence  ZSp[A. ,t.']  must  be 

greater  than  or  equal  to  a[B]  -  e[B]/2. 

r  dn 

Next  we  show  that  h|_B  J  cannot  be 
smaller  than  h[B]  -  1. 

As  before  we  assume  that 
h[ti*(A.)]=  h[(t.'A.)d]  for  all  i 
(see  lemma  8(1)).   It  is  equivalent  to 


( — erB]/2 
-a[B]  — 


Figure  3-12.   Tree  Height  Reduction 
by  Hole  Creation 


the  assumption  that  Sp[A.,t.']  <  e[t.'(A. )]/2  =  e[(t.'A. )  ]/2.   First  we  get  B! 

from  B  as  follows.   We  replace  every  (t.'A.)   in  B  by  a  product  P.  of 

e[(t  'A  )  ]/2  single  variables.   This  amounts  to  assuming  that  Sp[A. ,t  ']  = 
i  i  ii 

e[(t  'A  )  ]/2.   Futher  we  get  B"  from  B'  by  replacing  every  t.  in  B'  by  a 
i  i  0 

product  Q.  of  e[t.]/2  single  variables.   Then  it  is  clear  that  h[B]  >  h[B  ]  > 
J       J 

h[B']  >  h[B"]  =  h[B]  -  1.   If  ZSp[A.,t.*]  >  a[B]  -  e[B]/2,  then  h[B]  >  h[Bd]. 

Hence  h[Bd]  =  h[B]  -  1. 

(Q.E.D.) 
What  the  lemma  implies  is  that  by  space  filling  we  can  lower  tree 
height  at  most  by  one.   In  other  words  to  see  if  the  space  creation  by  distri- 
bution is  effective  it  is  enough  to  see  if  total  tree  height  can  be  lowered  by 
one,  and  we  know  if  tree  height  is  once  lowered  by  one  it  is  not  necessary 
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(i.e.  useless)  to  try  to  lower  tree  height  further  by  creating  more  space  by 
further  distribution. 

Unlike  a  set  of  holes  (Theorem  2),  the  space  function  for  A  =  2t. 

does  not  carry  any  information  about  space  in  components  of  A,  t..   For  example  let 

B  =  a(b(c+defg)  +  irl6)  +  Trl6  +  irk  =   a(A)  +  tt16  +  irk 
where  Ti  is  a  product  of  i  single  variables.   Then  h[ B]  =  7-   Now  Sh[A]  =  0  and 
space  creation  is  tried.   Note  that  SpfA.a]  =  16  <  a[  B]  -  e[B]/2  =  20,  but 
Sp[  c+defg,  ab]  =  k.   That  is,  a  as  well  as  b  should  be  distributed  over  c+defg. 
Thus  we  c-et  B1  =  abc  +  abcdefg  +  a(7rlo)  +  tt16  +  wh   where  h[B']  =  6. 

Now  this  situation  is  studied  in  detail.   In  general  we  have  a  form: 

F  =  ...  +t'(... +t" ( C ) +. . . +D+. ..)  +  ...  +  E  +  ... 


^ 

J 

V 

A 

V 

J 

w~ 


:-n  if  Sp[A,t']  is  not  enough  to  reduce  the  tree  height  h[  F] ,  we  have  to  further 
check  components  of  A,  e.g.  Sp[C,t't"].   As  we  will  show  later  (see  Substep  2  of 
Step  i\   of  Algorithm  given  in  Section  j.k-.l)   an  arithmetic  expression  is  examined 
from  the  inner  most  pairs  of  parentheses  to  the  outer  most  pair.   In  the  above 
diagram,  the  distribution  of  t"  over  C  is  first  checked  to  see  if  it  reduces  the 
tree  height  h[A]  and  then  the  distribution  of  t'  over  A  is  examined.   If  the 

itribution  of  t"  over  C  creates  space  and  reduces  the  tree  height  h[A],  then 
there  is  no  problem.   However  if  that  distribution  does  not  lower  the  tree  height 

.],  then  t"  will  not  be  distributed  over  C  (see  Algorithm).   As  we  showed  in  the 
above  example  when  we  check  the  possibility  of  reducing  the  tree  height  h[ F]  by 
creating  space  by  the  distribution  of  t'  over  A,  it  may  be  necessary  to  check 
,  ft"]  as  well. 


Qk 


Let  A'  =  ...  +  (t"C)d  +  ...  +  D  +  ...,  G*  =  (t'A)d  and  G"  =  (t'A')d- 
Now  we  show  that  if  Sp[A, t']  =  0,  then  it  is  not  necessary  to  examine  components 
of  A,  e.g.  Sp[C,t't"]  further.   This  helps  to  reduce  the  number  of  checks 
required. 

Lemma  10 : 

(1)  If  Sp[A, t']  =0,  then  h[...+G"+...+E+. ..]  <  h[...+G'+. ..+E+. ..] 

never  holds. 

(2)  If  Sp[A,t']  =  0,  then  h[G"]  <  h[G']  never  holds. 


Proof: 


(1)  We  prove  this  by  showing  that  if  Sp[A,t']  =  0  then  Sp[A',t']  =  0. 
Note  that  Sp[A, t']  =  0  implies  that  either 
(i)   h[t']  >  h[t"(C)] 
or  (ii)  h[tn]  =  h[C]  (Definition  9). 


By  Lemma  9,    in  either  case  we  get  h[(t'(t"C)  )  ]  >  h[t't"(C)]. 

Note  that  the  only  difference  between  G'  and  G"  is  that  a  term 

-— sr  d 

t't"(C)  in  G'  is  being  replaced  by  (t'(t"C)  )   in  G".   Since 


h[(t'(t"C)d)  ]  >  h[t't"(C)],  G"  cannot  have  more  free  nodes 
than  G'.   Hence  Sp[Af,t']  =  0. 
(2)  This  may  be  proved  in  a  similar  way  and  the  details  are  omitted. 

(Q.E.D.) 
Thus  in  F  =  ...  +  t' ( . . .+t" (C)+. . .+D+. . . )  +  ...  +  E  ...  the  distri- 
bution of  t"  over  C  should  be  done  if  it  reduces  the  tree  height  of  T[A], 
otherwise  it  should  be  left  untouched.   In  the  latter  step  when  the  distribution 
of  t'  over  A  is  examined,  the  possibility  of  distributing  t"  over  C  as  well 
shall  be  checked  if  and  only  if  Sp[A, t']  /  0.   Otherwise  we  shall  leave  t'(A) 
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as  it  is  and  we  need  not  check  inside  of  A  again. 

3.4  Algorithm 

Having  these  results,  an  algorithm  to  reduce  tree  height  of  an 
arithmetic  expression  is  now  described.   Given  an  arithmetic  expression,  the 
algorithm  works  from  the  inner  most  pairs  of  parentheses  to  the  outer  most  pair. 
We  assume  that  cases  (P,  )  and  (P  )  (see  Lemma  2)  are  already  taken  care  of.   At 

each  level  of  a  parenthesis  pair,  first  upon  finding  a  form  t'(A),  a  hole  of  A 
is  tried  to  be  filled  by  t'  (Theorem  2).   After  all  holes  are  filled,  a  form 
t'(A)  +  t"  is  checked,  i.e.  if  the  distribution  of  t'  over  A  creates  enough  space 
to  accomodate  t",  then  the  distribution  is  made.   Note  that  it  is  not  necessary 
to  fully  distribute  (A)(b)  =  (lt.)(Xt.T)  (see  Lemma  3)- 

It  is  not  necessarily  true  that  reducing  tree  height  of  a  term  t 
of  an  arithmetic  expression  A  reduces  tree  height  of  A.   However,  we  show 
that  reduction  of  tree  height  should  be  made  in  any  case  to  help  later  steps  of 
the  distribution  algorithm. 

n-1  n-1 

Let  A  =  Tt.    +  t      (or  ir  (t .  )  x  (t  ) ) .   Assume  that  the  distribution 
.  ,  1    n     .  ,   1      n 
i=l  i=l 

somehow  reduced  the  height  of  T[  t  ],  i.e.  the  distribution  algorithm 

n-1  n-1 

reduced  A  to  A'   =      I  t.    +  t   '    (or     it  (t.  )  x  (t    '))    where  h[  t   ]   >  h[  t    '].      Also 

.  ,  1    n      .  n   i7     n  ; '  LnJ    LnJ 

i=l  i=l 

assume  that  MA]  =  h[A'].   Yet  it  is  obvious  that  #(Sh[  A])  <  #(Sh[  A'  ] )  (i.e.  le 

£hr;.l  <  le  Sh[A'])  and  also  for  any  t",  Sp[A,t"]  <  Sh[A,t"].   That  is,  even  if 

distribution  only  reduces  the  tree  height  h[ t  ]  and .does  not  reduce  the  tree 

height  h[A],  that  distribution  does  not  cause  any  bad  effect  on  the  later  steps 

when  A  appears  as  a  term  of  a  bigger  expression  d  with  respect  to  holes  and  spaces , 
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The  arithmetic  expression  thus  obtained  may  give  the  lowest  tree 
height,  i.e.  the  fewest  number  of  computation  time  steps. 

3.k.l     Distribution  Algorithm 

In  the  steps  below  we  refer  to  the  notation: 

. k  -  1 k  -  1 

I    . k 1         I    r—  k — r     i k ! 1 


«(...«(   k     )&..  .)©...  ©(...©(     )e...e(  )©...)© 

'   s-l,i   I  I   sj   I     I   s,j+l  I 


H —  tl  — h     k << 

A    =  ..  .  +  t   .  ,     7T  (t  -  )  +  ... 


k       p 
where  t   .  ,  =  tt  a„  or  empty. 
s,J-l   l=1  t 


Step  1: 


Go  to  an  arithmetic  expression  enclosed  in  an  innermost  parenthesis 


pair  which  is  not  checked.   Let  this  level  be  k-1.   In  the  above  diagram  we  are 

«k-l 
now  working  on,  say,  A 

s 


Step  2: 

Obtain  a  set  of  holes  for  all  t  .  which  are  enclosed  in  the  k-th 

k-1 
parenthesis  pair  and  are  components  of  A   ,  as  well  as  their  heights  h  and 


effective  lengths  e 
Step  3: 


k-1 
In  this  step,  the  (k-l)th  parenthesis  pair  level  A  ""  is  examined. 
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Substep  1:  Hole  filling  (see  Theorem  2) 
Let 

0+n 


k       P 


t 


where  t      =  ir   a  or  empty.  Also  without  loss  of  generality  we  assume 

that  e[tg£]  <  e[tg  i+J]   for  l  =   j,  j+1,  ...,  j+n-1. 

k-1 

(1)  Find  an  occurrence  of  a  form  7r(t.)  or  Tra.  x  7r(t.)  in  A   .If  there 

J       i      0      s 

is  no  such  occurrence,  then  go  to  Substep  2.   If  an  occurence  of  a 
form  ir(t.)  is  found,  then  skip  (2)  and  (3).   Otherwise  go  to  (2). 

k-1 

(2)  Suppose  we  find  t  in  A    as  an  instance  of  Tra.  x  Tr(t.). 

s  1       j 

k  k 

Fill  holes  in  Sh[t  ]  (h=j, j+1, . . ., j+n)  using  a.'s  in  t   .   If  there 

are  many  holes  to  be  filled  in,  fill  the  smallest  ones  first,  i.e. 
in  order  of  increasing  size.   Reevaluate  Sh  by  Lemma  8  for  those 

t  .  whose  holes  are  filled, 
sh 

(3)  If  t  ,  (h=j,  j+1,  . . .,  j'+n)  do  not  have  enough  holes  to  accomodate  all 
a.'s  then  go  back  to  (l)  to  find  out  another  occurrence  of  Tr(t.)  or 

j+n   k 

Tra.  x  Trft.)  form.   Otherwise  we  work  on  tt  (t1  ,  )  which  we  get  from 
i      j  .  .   sh'         & 


P      J+n  v 
ira,  x  tt  (t\ )  after  (2). 
1-1  J   h=i   Sh 

(J+)  We  start  from  h  =  j.   Check  if  t'   can  fill  in  one  of  holes  in 

u  sh 
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Sh[t'   ]  (l=h+l, . . .,  j+n).   If  there  are  many  holes  which  can 
s  x. 

]^ 
accomodate  t'   ,  fill  them  in  order  of  increasing  size.   Continue 

the  procedure  until  all  t'   (h=j+l,  . . .,  j+n-l)  are  put  in  some  holes 

or  there  is  no  hole  to  accomodate  t'   .   Go  back  to  (l)  to  find 

sn 

out  another  occurrence  of  7r(t.)  or  ira..   x  7r(t.)  form. 

J       i      J 

Substep  2:   Space  filling 

k-1  k 

After  Substep  1,  we  again  check  A   ,  where  all  holes  in  t  .  have 

s  sj 

been  filled  in  as  much  as  possible  by  Substep  1. 

(1)  Let  Ex  =   a[Ak_1]    -   e[Ak_1]/2    (see   Lemma  9)- 

(2)  Let 

k-1         k       ^+n-1  k  n     k 
A   "=...+t   .  -,  x  it     (t,)x(t   .,)+... 

s  S'J_1    h=j   S     cf'  Jt^ 

~Yf  t 

k       p  k 

where  t   .  n  =  tt  a,  or  empty.   We  also  assume  that  eft   .   1  is 
B,d-1   !=i  *  s,j+nJ 

k  k 

the  largest  among  all  e[t  ,]  (h=j, . . ., j+n) .   Let  t'  =  t   .  .  x 

sn  sf  j -± 

j+n-1     .  .  . 

tt_    (tgh).      If  h[t']    «h[t*         ],    then  evaluate   Sp[t*  ^  f  ] . 

Otherwise  leave  it  as  it  is. 

(3)  Repeat  (2)  for  every  occurrence  of  a  form  tt(  t . )  (or  ira..   x  ir(t .)) 

J        -'-J 

k-1 
in  A   .   Assume  that  there  are  m  such  occurrences.  Arrange  all 
s 

~p[t,t']  in  order  of  decreasing  size.   For  convenience  we  write 

Spr  Sp2,  ...,  SpJSp.  >  sP.+1). 
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m  d-1  d 

(+)   If   Z  Sp.  >  Ex  then  let  d  be  such  that  Z  Sp.  <  Ex  and  Z  Sp  >  Ex. 

i=l  x  i=l  x        i=l 

k      ^+n_1   k      k 

(5)  Let  t   -tX  it       (t>.)x('t   •   )  be  a  form  which  corresponds  to 

s,j-±  '   ._.    sn      s,  j+n 

1 2 ' 

t'  t 

Sp.(i  <  d).   Then  distribute  t'  over  t,  and  create  space  Sp.. 

Repeat  the  same  procedure  for  all  i  =  i,  2f    . ..,  d. 

m 

(6)  In  the  case  where  enough  space  to  accomodate  Ex  (i.e.  Z  Sp.  <  Ex) 

i=l  x 

]^ 
is  not  found,  a  check  is  made  against  the  component  terms  of  t 

(see  Lemma  11). 

For  example  let 

t   .-=a.art..«a,n=l 
s,j-l    12      p' 

and  tk  .  =  bnb0  ...  b  (t^1)  +  Z   t*!"1. 
s,j    12      qv  sf     i=1  si 

Then 


Ak-1       .  Jc      ,,.  ,.      ,.  ^k+lN  .  ™  , 
s 


..  +t*   _  x  (bnb0  ...  b  (t"Ix)  +   lO  +  .. 
s,  j-1     12      q  sf     .  _  si 

i=l 

i/f 


k+1 
Then  the  distribution  is  done  if  the  sum  of  Sp.  and  Sp[t  „  , 

t   .  _  x  bn  . . .  b  '  is  greater  than  or  equal  to  Ex.   Here  the  dis- 
s,J-l    1      qJ 

k+1 
tribution  of  a,  ...  a  b,  ...  b  over  Z  t  .'  is  to  be  made  as  well 
1      pi      q        si 

k+1 
as  the  distribution  of  a,  ...  a  over  t  _  .   This  checking  is  to 

1      p       sf 
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be  made  until  enough  space  to  accomodate  Ex  is  found  or  else 
until  the  innermost  level  of  parenthesis  pair  is  reached. 


Step  k; 


k-1 
Mark  A    as  checked, 
s 


Step  5: 


If  all  levels  are  checked,  then  stop,  otherwise  go  back  to  Step  1. 


For  example  let  us  consider  the  following 
A  =  ...  +  a1a2a5(t1)(t2)(t5)  +  

t 
Further  assume  that 

Sh[tx]  =  Ul,  e[t2]   =  16 

Sh[t2]  =  (16,21,  e[t2]  =  61+ 

and 

Sh[t3]  ,  0    e[t3]  =  64. 

Then  a, apa,  can  be  distributed  over  t,,  and  in  turn  this  whole  thing  can  be  distri- 


buted over  t, 


. .  +  a..  a2a-,  { t,  4 


o 


(t0)(tj  + ... 


and  we  get  h[t']    =  7  whereas  h[t]    =  8. 
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3.^-2  Implementation 

A  few  words  about  implementation  of  the  algorithm  described  above  are 
given  as  well  as  the  total  number  of  checks  required  to  process  an  arithmetic 
expression.   Suppose  we  are  given  the  aritmetic  expression 

A  =  ...  +  (7r24^rl)(7rlO+(7r4-Pn-l)(7rl7-H7r2)-P7r3)  +  ... 

=  ...  +  (d1+d2)(e11+(e2i+e22^e31+e32^  +  el+l^  +  "•• 
=  ...  +  (D)(E1+(E2)(E3)  +  E^)  +  ... 

=  ...  +  (D)(E)  +  .  .  . 
where  iri   represents  a  product  of  i  single  variables.   Then  we  build  nested  stacks 
as  shown  in  Figure  3.13(a).   Note  that  a  new  stack  is  created  for  each  form  ir(t.) 

or  Z   t . . 

1 


e21(x) 

B(+) 

eu(x) 

E2(+) 

(+) 

TVT      fv\ 

1 

im^x; 

"2KAJ 

«= 

e00(x) 

A  <— 

^— 

<— 

>> 

E?(+) 

D(+) 

\ 

I 

\llXJ 

~'~~~-e^n 

d.(x) 

31 

d2(x) 

^^e32 

level 


- 


m-k 


m-3 


m-2 


m-1 


m 


Figure  3-15-   Stacks  for  an  Arithmetic  Expression 
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(b) 


E   4- 


w2(x; 


Sh 


A 


_L 


1,2,  If,  8, 
8,16,32 


t,5,6 


E2(+) 


Sh 


if  a 


iii 


Sh 


A 


1,2,4,8 


2,3,4,5 


2, 3  ,4,  5 


e21(x) 


h 

2 

Sh 

0 

FA 

0 

e22(x) 


/"       e31(x) 


e„(x) 


h 

1 

Sh 

0 

FA 

0 

Nl  «■ 


(c) 


Sh 


A 


El£L 


2,4 


3,^,5 


3,4 


"MJ. 

N2'(x) 

h 

5 

Sh 

1,2,4,6 

FA 

1,2,3,5 

/* 

e3i 

N2"(x) 


Sh 


e21(x) 


E   (  +  ) 


h 

3 

Sh 

0 

FA 

1,2 

FE 

0 

<- 

1,2 

'22 


Figure  3- 13 •      (continued) 
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Each  stack  is  assigned  a  level  number  (cf.  Definition  3)  where  the  first  stack 
which  corresponds  to  A  receives  the  level  number  0  (Figure  3« 12(a)). 

We  start  working  from  a  stack  with  the  largest  level  number,  say  m.   For 
each  stack  t,  where  t  =  It.  or  t  =  ir(t.),  h[t]  is  evaluated.  Also  if  a  stack 

represents  a  form  t  =  It.,  then  Sh[t],  F  [t]  and  F  [t,  t.]  are  evaluated.   If  a 

stack  represents  a  form  t  =  7r(t.  ),  then  Sh[t]  and  F  [t]  are  evaluated.   These 

values  are  obtained  by  Definitions  1,  h-   and  8.   Note  that  this  information  is 
sufficient  to  evaluate  Sp.   Figure  3.12(b)  gives  an  illustration. 

Upon  finding  a  form  7r(t.)  (or  Tra .  x  7r(t.))  (e.g.  the  stack  N ),  we  apply 

the  distribution  algorithm  and  decide  if  distribution  is  to  be  made.   If  a  stack 
represents  a  form  t  =  7r(t.),  then  Substep  1  of  the  distribution  algorithm,  i.e. 

hole  filling,  is  tried.   Otherwise  a  stack  represents  a  form  t  =  I  t.  and  Substep 

2  of  the  distribution  algorithm,  i.e.  space  filling  is  applied.   In  our  example  E 

is  distributed  over  E,(e[Ep]  <  le  Sh[E,]).   Then  stacks  are  revised  as  shown  in 

Figure  3 .12(c).   Note  that  the  stack  N  is  replaced  by  two  new  stacks  N  '  and  N  ", 

and  the  stack  E  disappears. 

If  all  stacks  with  the  level  number  k  have  been  checked,  then  stacks  with 
the  level  number  k-1  will  be  checked.  In  our  example,  stacks  E(or  E'  since  it  has 
been  revised)  and  D  are  now  checked. 

The  total  number  of  checks  required  to  process  a  whole  arithmetic 
expression  thus  depends  on  the  number  of  parenthesis  occurrences  in  it.   Assume 
that  there  are  p  parenthesis  pairs  in  an  arithmetic  expression  A.   For  each  pair, 
space  creation  should  be  examined.   Hence  in  total  p  space  creation  checks  are 
required.  Now  for  each  ir(t.)  form  hole  filling  should  be  tried.   The  number  of 
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occurrences  of  a  form  ir(t. )  in  A  is  obviously  less  than  p.   Hence  the  total  number 
of  checks  required  is  less  than  2p  (i.e.  the  order  of  p). 

3.5  Discussion 

3.5.1  The  Height  of  a  Tree 

Given  a  tree  for  an  arithmetic  expression,  the  distribution 
algorithm  tries  to  lower  tree  height  by  distribution  if  possible.  However, 
in  general  it  may  not  give  the  minimum  tree  height.   For  example  let  A  =  ac  +  ad  + 
be  +  bd  whose  tree  height  is  3>  and  since  no  further  distribution  is  possible, 
the  distribution  algorithm  yields  the  same  value.   There  is,  however,  an  equivalent 
expression  A'  =  (a+b)(c+d),  whose  tree  height  is  lower  than  3,  i.e.  2.   That  is, 
even  though  factorization  lowers  tree  height  sometimes,  the  distribution  algorithm 
does  not  take  care  of  it. 

The  question  we  ask  now  is  how  much  the  distribution  algorithm  lowers 
tree  height.   Before  giving  an  answer  to  this  question  let  us  study  tree  height 
in  more  detail.   Given  an  arithmetic  expression,  Theorem  1  gives  the  exact  height 
of  a  tree  obtained  by  Bovet  and  Baer's  algorithm.   It  is  also  desirable  if  we  can 
get  an  approximate  tree  height  without  actually  building  a  tree  for  an  arithmetic 
expression.   Since  the  number  of  single  variable  occurrences  (the  number  one  less 
than  this  gives  the  number  of  operators  in  an  arithmetic  expression)  and  the 
depth  of  parenthesis  nesting  may  well  represent  the  complexity  of  arithmetic 
expressions,  let  us  try  to  approximate  tree  height  in  terms  of  them. 

Let  A  be  an  arithmetic  expression  with  n  single  variable  occurrences 
and  depth  d  of  parenthesis  nesting.   Now  build  a  tree  for  A  by  Bovet  and  Baer's 
algorithm.   Then  it  can  be  proved  that: 
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jemma  11: 

log2rnl2  <  h[A]    <  n  -   1. 

Moreover  we  can  prove  the  following  theorem. 

Theorem  3 • 

h[A]   <  1  +  2d  +  log[nl2. 

The  following  lemma  is  helpful  to  prove  Theorem  3< 


Lemma  12 


(1)     2a>ra], 


(2)      f2al2  =  2fal2 


(3)      log 


r    P 

r p 

-[ml 
i=l 

<  log (2    • 

2 

r  m 
1=1 

i   2 


Proof: 


(l)   and  (2)   are  obvious   and   (3)   can  be  proved  by   (l)   and   (2) 


(Q.E.D.) 


Proof  of  Theorem  2: 

Proof  is  given  by  induction  on  d.   First  let  us  prove  the  theorem  for 
d  =  0-   Then  A  has  the  following  pattern: 

A  -  F   it  a.. 
i=l  j=l  r 

Then  by  Theorem  1, 

h[A]  =  log    Z  T  q  1 

i=l 
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<  log  2 


r  p 


by  Lemma  5(3) 


=  1  +  log[n]2. 
Nov  assume  that  the  theorem  holds  for  d  <  f . 

Let  t.  be  an  arithmetic  expression  with  depth  d.(<  f)  parenthesis 

J  J 

i  i         ^ 

nesting  and  n.  single  variable  occurences.   Then  by  assumption  h[t.]  <  1  +  2d't 

J  ■  J         J 

+  log[n_. lp-   Now  an  arithmetic  expression  with  f  +  1  parenthesis  nesting  can  be 

built  from  t.  as  follows: 

(q.     m. 
n.  .   i  . 
ir  a'  tt  (tih 

j=l  3  k=l  K 

where  a.  are  single  variables  and  at  least  one  of  t.  has  f  nested  parentheses. 

Now  each  t.  can  be  replaced  with  a  product  of  e[t.]  single  variables  without 

J  J 

affecting  the  total  tree  height.   Instead  of  using  the  value  e[t.],  let  us  use 


2d' 


the  value  2-2      J    .    Tn1^.      (Note  that  h[t^]    <  1  +  2d1   +  logrn1!      =   log (2   • 


2d1.  .  .  h[tX]  2d* 

2      J    •    TrulJ    and  eft1]    =2        J      <  2    •    2     J    •    rn*"L.)      Since   d1.   <  f,    e[t*]    < 

2f     i 
2-2     fn  1    •      Now  from  Theorem  1,    we  have 
J   2 


h[A]    <  log 


r  p 


TCI. 


12+      Z    2-S2frnj12l2 

3=1 


r  P 


m. 


<  log  2 


s2f  i- 


2    [q     +     r    2   •    2     nt] 

3=1  J 


1-= 
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|2.2.2.22f    r  (q,  +  rnbl 


<  logl2-2-2-2 

=  1  +  2(f+l)  +  logrnl2. 
Thus  the  theorem  holds  for  d  =  f  +  1  and  this  proves  the  theorem. 

(Q.E.D.) 

Now  let  us  examine  the  original  question  i.e.  how  effective  is  the 

algorithm  presented  in  this  chapter.   Let  A  and  A  be  arithmetic  expressions 

d 
where  A  is  the  resultant  expression  obtained  from  A  after  the  application  of 

the  distribution  algorithm.   Now  build  trees  for  A  and  A  by  Bovet  and  Baer's 

algorithm.   Then  it  should  be  clear  that  h[A]  >  h[A  ].   Moreover  experience 
suggests  that: 

Conjecture: 

h[Ad]  ^  2  log2rnl2 

where  n  is  the  number  of  single  variable  occurrences  in  the  original  arithmetic 
expression  A. 

Note  that  the  distribution  algorithm  speeds  up  a  Horner's  rule 
polynomial  in  a  logarithmic  way.  Also  note  that  the  distribution  algorithm 
does  no  distributions  in  the  case  of 


which  takes  21ogr nl   steps  as  it  is  presented  but  which  would  take  (n+l)  logrn] 

steps  if  fully  distributed.   Thus  the  algorithm  can  save  a  factor  of  n/2  steps 
over  a  scheme  which  would  distribute  indiscriminately  and  in  some  cases  achieves 
a  logarithmic  speed  up. 
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3.5.2  Introduction  of  Other  Operators 

3.5.2.1  Subtraction  and  Division 

Subtractions  can  be  introduced  into  an  arithmetic  expression  without 
causing  any  effect  on  the  distribution  algorithm.   It  may  be  necessary  to 
change  operators  to  build  a  minimum  height  tree.   For  example  let  A  =  a  +  b  - 
c  +  d.   This  will  be  computed  as  A  =  a  +  b  -  (c-d): 


Divisions  may  require  special  treatment  since  the  distributive  law  does  not  hold 
in  certain  cases,  e.g.  (a+b+c)/d  =  a/d  +  b/d  +  c/d  but  a/(b+c+d)  /  a/b  +  a/c  + 
a/d.   Hence  in  general  minimization  of  the  height  of  trees  for  a  numerator  and 
a  denominator  is  tried  independently,  and  then  distribution  of  a  denominator 
over  a  numerator  is  tried  if  appropriate.   Also  let  A  =  t/t'.   Then  T[A]  is 
built  from  T[t]  and  T[ t ' ]  as  follows: 


T[t], 

or 


If  h[t]  /  h[t'],  then  we  get  nodes  to  which  only  one  tree  is  attached,  e.g. 
a  and  (3  if  h[t]  <  h[  t'],  and  a'  and  3*  if  h[t]  >  h[t']i   Then  a  and  p  are 
treated  as  free  nodes  in  T[A]  while  a'    and  3'  are  not  treated  as  free  nodes  in 
T[A],  because  later  when  another  expression,  say  t",  is  multiplied  to  A,  t  must 
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be  multiplied  by  t"  not  t',  i.e.  t"(A)  =  t"(t/t')  =  (t"t)/t'  ^  t/(t't**): 


T[t" 


T[f] 


3- 5-2.2   Relational  Operators 

If  an  arithmetic  expression  A  contains  relational  operators  e.g. 
B  I  )  C  -where  RO  =  [>.  <,  =,  >,    <,  .  ••},  trees  can  be  built  for  B  and  C 
independently: 


A  = 


T[B] 


T[C] 


If  h[B]    -*'  h[C],    then  operators  may  be  moved  from  one   side   to  the  other  to 
balance   two  trees.      For  example   let  A=a+b+c>d.      Then  we  modify  this  as 
A'    =   a  +  b  >  d  -   c  and  get  h[A']    =2  while  h[A]    =   3: 


b       d 
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k.      COMPLETE  PROGRAM  HANDLING 

Chapter  3  presented  the  algorithm  which  reduced  tree  height  for  a 
single  arithmetic  expression  by  distributing  multiplications  over  additions 
properly.   In  this  chapter  we  will  discuss  some  ideas  about  how  to  handle 
complete  programs,  i.e.  given  one  program,  how  can  it  be  executed  in  the 
shortest  time  by  building  a  tree  as  well  as  executing  a  statement  in  a  for 
statement  simultaneously  for  all  index  values.   Ideas  include  back  substitution. 
We  do  not  have  the  solution  to  the  problem,  but  this  chapter  presents  some 
details  of  the  problem  and  some  ways  to  attack  them. 

We  conclude  this  chapter  by  comparing  serial  and  parallel  computations 
in  terms  of  a  generated  error.   It  is  shown  that  in  general  we  could  expect  less 
error  from  parallel  computation  than  serial  computation.   It  is  also  shown  that 
distribution  would  not  increase  the  size  of  an  error  significantly. 

4.1  Back  Substitution  -  A  Block  of  Assignment  Statements  and  an  Iteration 

While  the  distribution  algorithm  in  the  previous  chapter  discusses 
tree  height  reduction  for  a  single  arithmetic  expression,  it  can  be  used  for 
any  jump  free  block  of  assignment  statements.   If  we  define  those  variables  which 
appear  only  on  the  right  hand  sides  of  assignment  statements  or  in  read  statements 
in  a  block  as  inputs  to  the  block,  and  those  variables  which  appear  only  on  the 
left  hand  sides  of  assignment  statements  or  in  write  statements  as  outputs  from 
the  block,  then  we  can  rewrite  the  block  with  one  assignment  statement  per 
output  by  substitution  of  assignment  statements  into  one  another.   For  example 


101 


a  :=  b  +  c; 

d  :=  e  x  f; 

g  :=  a  +  c; 

h  :=  a  +  g  x  d 
can  be  rewritten  as 

g  :=  (b+c)  +  c; 

h  :=  (b+c)  +  ((b+c)  +  c)  x  (exf ) . 
After  such  a  reduction  only  input  variables  appear  on  the  right  hand  sides  of 
assignment  statements.  At  this  point,  the  distribution  algorithm  could  be 
applied  to  each  remaining  assignment  statement  and  if  sufficient  computer 
resources  were  available,  all  of  the  reduced  assignment  statements  could  be 
executed  at  once-   In  the  above  example  if  each  statement  is  computed  in 
parallel  (by  building  a  tree)  independently  then  5  steps  are  required,  while 
if  the  back  substitution  is  done  then  the  computation  requies  only  +  steps. 
Suppose  we  have  assignment  statements,  A  , A  ,  ...,A  .  Also  suppose  that  by 

back  substitution  we  can  rewrite  this  block  as  A.   We  build  minimum  height 
trees  for  A, ,A  , ...,A  and  A.   Now  we  apply  the  distribution  algorithm  on  those 

trees.   Let  the  resultant  tree  heights  be  h|  A  ],..., h[A  ]  and  h[A] .   Then 

obviously  hrA-,]  +  ...  +  hi" A  ]  >  h[A],  i.e.  back  substitution  never  increases 
the  computation  time  (in  the  sense  of  tree  height)  (Figure  l+.l). 

Our  main  interest  here  is  the  case  where  strict  inequality  in  the 
above  relation  holds,  because  that  h[A, ]  +  ...  +  h[A  ]  >  h[A]  holds  is 

equivalent  to  a  speed  up  of  the  computation  by  back' substitution.   Note  that 
back  substitution  amounts  to  symbol  manipulation  (i.e.  replacement)  and  should 
not  be  confused  with  arithmetic  simplification.   For  example  from 
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w 


V 


h[An] 


Figure  k.l.     A  Back  Substituted  Tree 


a  :=  x  +  y 

b  :=  a  +  y 
we  get 

b  :=  (x+y)  +  y 
orb  :=x+y+y 
but  we  do  not  get 

b  :=  x  +  2y  . 
Now  we  shall  study  this  kind  of  speed  up. 

We  shall  discuss  a  limited  class  of  assignment  statements,  i.e.  an 
iteration.   This  may  serve  to  give  some  insight  to  the  problem  of  speed  up  by 
back  substitution  in  the  general  assignment  statement  case. 

By  an  iteration  we  mean  a  statement 

'i  ■  f(yi-i'- 

Usually  a  statement  is  executed  repeatedly  for  1  =  1, 2,'. .  .,n.  An  example  is: 
for  I  :=  1  step  1  until  10  do 
A[I]  :=  A[I-1]  +  A[I]; 
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Also  a  block  of  assignment  statements  such  as: 
S1:  a  :=  h  +  i  +  j; 

S?:  b  :-■  a  +  k  +  m; 

S,:   c  :=  b  +  n  +  p; 

S.  :   d  :=  c  +  q  +  r; 

falls  into  this  category  (note  that  all  statements  have  a  form 
output  of  S.  :  =  output  of  S.    +  x  +  y 

where  x  and  y  are  pure  inputs  in  the  sense  that  they  do  not  appear  as  outputs). 
Assume  that  we  are  only  interested  in  the  value  of  y  (the  other 

results,  i.e.  y  ,,-y  rt,  . . . ,  y,  may  be  obtained  similarly  to  y  but  in  less 

n-1   n-2        1  n 

time).   Then  instead  of  n  statements,  i.e.  y,  =  f(..),  yp  =  f(..),...,y  = 

f(..),  we  n^ay  obtain  one  statement  for  y  by  back  substitution.   For  example, 

let  y.  =  a.   y.   .   Then  y  =  a  ny  ,  =  a  .(a  ^y  _)  =  a  .(a  _(a  ,y  _)) 
1    l-l l-l        n    n-l^n-1    n-1  n-2Jn-2     n-1  n-2  n-3  n-3 

n-1 

=  •  •  •  -  y0  tt  a,  .   We  use  the  superscript  "b"  to  distinguish  the  back 

b      n_1 
substituted  form  from  an  iteration  form,  e.g.  y  =  a  ,y  .  and  y  =  y^  tt  a.  . 

Jn         n-1  n-1      n    0  .  _  k 

k=0 

Then  instead  of  computing  each  y.  repeatedly  for  i  =  1, 2,  . .,n,  y  may  be 

computed  directly.   In  the  above  example  y.  can  be  computed  in  one  step*  and 

to  get  y  n  steps  are  required  while  y  can  be  computed  in  r log  nl  steps  in 

parallel  (i.e.  by  building  a  tree  for  y  ).   The  following  table  summarizes  the 
results  for  some  primitive  yet  typical  iteration  formulas. 
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yi 
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yn 
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P 
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n 
a  v~ 

1 

n 

ri0g2(n  +  ljl 

yi.i  +  b 

n 

yQ  +*b  +"/."+  bA 

1 

n 

flog2(n  +  1? 

ai-iyi-i 

n-1 

Vo 
k=0  K  u 

1 

n 

log^ 

y    -,  +  a 
°i-l        i-1 

n-1 

z  \  +  yo 

k=0  K 

1 

n 

flog2n1 

ay._x  +  b 

*\  +  Pn-l(a^ 

2 

2n 

*2fiog2(n  +  1? 

ayi-l  +  Xi-1 

.    .    ■*-* 
p;(a) 

2 

2n 

ar2flog2(n  +  1)1 

yi-l  +  bxi-l 

n-1 

z  bxk  +  yo 

k=0     K         u 

2 

2n 

«  'log  nl   +  1 

2 

ayi-l  +  bxi-l 

p"   a) 
*n 

3 

3n 

-2riog2(n  +  1)1 

/•    \       -l.   n-1       ,    n-2 
p     -  (a)   =  ba         +  ba         +   . . .    +  b 


„ .  . /    \  n        ,      n-1        ,      n-2 

**     Pn    (a)   =   a  y0   +  a       X0   +  a       Xl  +   "•    +  "n-2   +  Xn-1 

„v.j,  ti  /    \  n  ,    n— 1  ,    n-2  , 

***     p      (a)   =   a  y     +  ba       x^  +  ba       at,    +  . . .    +  bax     n  +  bx     . 

n  J0  0  1  n-2  n-1 

T    :      The  time  required  to  compute  y.    in  parallel,    i.e.   h[y.]. 
T    :      The  time  required  to  compute  y     in  parallel,    i.e.   h[  y  ]. 


Table  k.l.     Comparison  of  Back  Substituted,   y,    and  Non-Back  Substituted 
Computation,   y. — Iteration  Formulas 
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From  Table  *4-.l,  the  following  lemma  is  obtained  by  exhaustion. 

Lemma  1: 

Let  y.  =  f (y.  , )  be  linear  in  y.  .  where  we  assume  that  in  the 

presented  form  additions  are  reduced  to  multiplications  as  much  as  possible, 
e.g.  y.  =  2yi_1  instead  of  y±   =  yi_1  +  yi_1-   Then  n  x  h[y.,]  >  h[ynl- 

Thus  if  we  have  enough  FE's,  then  instead  of  computing  each  y. 

repeatedly  for  i  =  1, 2,  ...,n,  we  should  obtain  y  by  back  substitution  and 

compute  it  by  building  a  minimum  height  tree. 

If  an  iteration  y.  =  f(y.  , )  is  not  linear  in  y.  _,  e.g.  y.  =  a  y?  , 

+  b  y^   +  c,  or  if  it  is  linear  in  y.   but  there  are  some  additions  not  being 

reduced  to  multiplications,  e.g.  y.  =  y.  ,  +  a  y.   ,  then  it  is  not  clear  if 

back  substitution  speeds  it  up.   For  example,  back  substitution  does  not  speed 
up  the  computation  of  y±   =  yi_1  +  yi_1  +  y±_1  +  Y±_1'     Also  let  yi  =  f(yi_1) 

be  a  polynomial  of  y.  ,  where  in  the  presented  form  additions  are  again  reduced 

to  multiplications  as  much  as  possible.   Then  it  is  not  likely  that  we  can 

speed  up  the  computation  by  back  substitution.   Let 

„/    \       m 

f (y.  ,  )  =  a  y.    .    +  ... 

where  y.   is  the  highest  power  of  y.  ,  among  those  which  appear  in  f(y.   ). 
Note  that  f (y.   )  is  not  necessarily  a  dense  polynomial  (a  polynomial  in  which 

all  powers  of  y.   ,  i.e.  y.  ,,  y.  ,,  ...,  y.  .,  appear).   While  the  exact  height 
of  T[ f (y.  , )]  depends  on  f(y.   ),  we  may  content  ourselves  with  (see  Chapter  2) 
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h[f(yi-l)]  ~  2riog2m1' 


Hence  2nriog?ml  steps  are  required  to  compute  y  . 

Now  let  us  consider  y  .   Then 

n 


b  ,  b       m 


n         m     n-l 


m  m 

=  a   (a   (y     0)     +  . . . )     + 
mv   m  wn-2 


m         m 

=   a   (a   ( . .  .a  yn. . . )    . . . )      +   ... 

mm  nrO        ' 

i      n 
n-l  m 
=  am     yQ     +   ...    . 

That  is,    y     becomes   a  polynomial  of  y     of  degree  m  .      Leaving  the  computation 

of  a         out  of  consideration,    we  have   (see  Chapter  2) 

h[y£]   *  2[-log2mnl 

*  2nr  logpinl  . 

Hence  back  substitution  does  not  help  to  speed  up  computation  significantly 
in  this  case. 

To  gain  a  better  understanding  of  more  general  cases,  let  us  study 
the  situation  from  a  different  point  of  view.   Given  an  iteration  y.  =  f (y.  , ), 

let  us  consider  the  number  of  single  variable  occurrences  in  y  as  a  measure  of 

e  Jn 

the  complexity.  We  study  two  cases  separately,  i.e.  (i)  y.  ..  appears  only  once 
in  y.  and  (ii)  y.   appears  k  times  in  y. .   In  both  cases  we  assume  that  there 
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are  m  single  variable  occurrences  (including  the  occurrences  of  y.   )  in  y. . 

For  convenience  we  write  N(y)  for  the  number  of  single  variable  occurrences  in 
y,  e.g.  N(a+b+cd4a-e)  =  6. 

(l)  y.  ,  appears  only  once  in  y. .   In  this  case  we  have 

N(y1)  =  m 

N(yg)  =  N(y;L)  +  m  -  1  =  2m  -  1 


N(y^)  =  N(y^)  +  m  -  1  =  2m  -  2 


N(y  )  =  N(y  ,)+m-l  =  mn-n  +  l~  ran. 
Jn  n-1 

(2)  y.    appears  k  times  in  y. .   In  this  case  we  have 
N(y1)  =  m 

N(y^)  -  k  •  N(y^)  +  m  -  k  =  m  +  k(m-l) 

N(y^)  =  k  ■  N(Vg)  +  m  -  k  =  m  +  (k+k2)(m-l) 

•  •  • 

N(y^)  -  k  ■  N(y^_1)  +  m  -  k 

=  m  +  (k  +  . . .  +k~  )(m-l) 

=  1  +^-i(m-l). 

If  kn  »  1  and  m  »  1,  then  N(yb)  s  kn_1m. 

>  wn 

Now  if  we  use  21"logpN(y)l  as  a  measure  of  the  height  of  a  tree,  then 
we  have  (see   Section  3- 5*1  of  Chapter  2): 
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h[y±] 

n  x  HV±] 

«# 

(1)          2flog2m] 

2n[ log2ml 

2flog2(mn)l  *  2(riog2m]    +  rioggii"!) 

(2)          2riog2ml 

2nf log  ml 

2riog2(kn"1m)l  «  2((n-l)riog2kl    +  floggml) 

Table  k.2.      Comparison  of  Back  Substituted,  y  ,  and  Non-Back  Substituted 
Computation,  y^ — General  Cases 

For  example  let  m  =  5,  k  =  2  and  n  =  20  in  (2).   Then  we  have 
n  .  hfy.l  =  k0    •    T  log  51  =120 

and  h[y£]  =  2(l9flog22l  +  flog^l)  =  kk. 

Also  if  we  let  m  =  5  and  n  =  20  in  (l),  then  we  get 
n  •  h[y.]  =  i+oriog25l  =  120 

and  h[y*]  =  2(riog25l  +  Tlog^Ol)  =  16. 

Now  a  few  comments  about  implementation  are  in  order.  As  for  back 
substitution  of  a  block  of  assigment  statements,  the  step  by  step  substitution 
is  the  only  possible  scheme.   In  case  of  an  iteration  formula,  we  may  use  the 

z-transformation  technique  to  obtain  y  [8].   For  example  let  y.  =  y.  n  +  x. . 
Then  by  applying  z-transformation  on  it,  we  get  Y(z)  =  zY(z)  +  X(z)  or  Y(z)  = 


^Z^  .   Hence  Y(z)  =  X(z)(l  +  z  +  z2  +  7?   +  .  .  .  ) 

2  2 

=  (x   +  X-.Z  +  X  z   +  . . .  )  (l  +  z  +  z   +...) 


=  xQ  +  (x1  +  xQ)  z  +  (x2  +  x±   +   XQ) 


z   +  .  .  . 
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or  y  =  Z    \> 

1   k=0  * 

Two  other  related  problems  become  evident  in  the  example  presented 
above.   First  is  algebraic  simplification.   For  example,  a  :=  b  +  kc   could  be 
executed  more  quickly  than  a  :=  b+c+c+c+c  We  shall  not  discuss  this  subject 
further  here.  A  second  problem  is  the  discovery  of  common  subexpressions.   In 
our  example,  (b+c)  appears  twice  in  the  right  hand  side  of  it. 

If  we  had  an  algorithm,  e.g.  [11]  which  discovered  common  sub- 
expressions in  one  (or  more)  tree  which  could  be  simultaneously  evaluated,  the 
number  of  FE's  required  could  be  reduced  by  evaluating  the  common  subexpressions 
once  for  all  occurrences.   On  the  other  hand,  by  removing  common  subexpressions 
the  execution  time  (the  height  of  a  tree)  may  be  increased  in  some  cases.   For 
example,  if  we  have  x  :=  a(b+c+de)  and  y  :=  f(g+c+de),  then  we  might  try  to 
replace  c  +  de  in  x  and  y  by  z  as  follows  to  save  the  number  of  PE's  required: 

x 


However,  note  that  x  =  a(b+z)  or  y  =  f(g+z)  takes  k   steps  while  the  original  x 
and  y  require  only  3  steps,  i.e.  h[a(b+c+de)]  =  3  and  h[a(b+z)]  =  k.      Thus  an 
overall  strategy  must  be  developed  for  the  use  of  a  common  subexpression 
discovery  algorithm  in  conjunction  with  overall  tree  height  reduction. 
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h . 2  Loops 

This  section  is  included  here  to  complete  this  chapter,  and  discusses 
the  subject  superficially.   Details  will  be  presented  in  the  following  chapters. 

Consider  the  following  example. 
El:  for  I  :=  1  step  1  until  10  do 
for  J  :=  1  step  1  until  10  do 
S3:  A[I,J1  :=  A[I,J-1]  +  B[j]; 
In  this  case  ten  statements,  A[l,  J]  :=  A[1,J-1]  +  B[J],  A[2,  J]  :=  A[2,  J-l]  + 
B[J],  ...,  A[10,  J]  :=  A[10,J-1]  +  B[J]  can  be  computed  simultaneously  while  J 
takes  values  1,2, ...,10  sequentially.  We  say  that  S3  can  be  computed  in  parallel 
with  respect  to  I.   Note  that  originally  the  computation  of  El  takes  100  steps . 
(One  step  corresponds  to  the  computation  of  S3,  i.e.  addition.   For  the  sake  of 
brevity  we  only  take  arithmetic  operations  into  account  and  shall  not  concern 
with  e.g.  operations  involved  in  indexing.)  By  computing  S3  simultaneously  for 
all  values  of  l(l  =  1,2, . ..,10)  the  computation  time  can  be  reduced  to  10  steps . 

b  10 

Finally  by  building  a  minimum  height  tree  for  A  [1,10] (:=  A[I,0]  +  Z  B[J]) 

J=l 

for  each  I  (i  =  1,2,..., 10),  we  can  compute  all  ten  trees  simultaneously  in 
h_   steps. 

To  help  understanding,  let  us  further  consider 
L:  for  I  :=  1  step  1  until  N  do 

for  J  :=  1  step  1  until  N  do  S; 
Then  Figure  ^.2 (a)  shows  the  execution  of  L  as  it  is  presented.   The  total 
computation  time  required  (t)  is  N,  x  N  x  m,  where  we  assume  that  m  arithmetic 

operators  are  in  S.   Now  suppose  S  can  be  computed  in  parallel  with  respect  to 


Ill 


I,  (Figure  4.2(b)).   In  Figure  4.2(b),  each  box  has  the  form  shown  in  Figure 

lj-.2(c).   Here  S       11  computed  sequentially,  i.e.  T.  =  mNp.   Now  let  us 

compute  S  in  parallel  i.e.  by  building  a  tree  (Figure  4.2(d)).   Then  we  have 
TQ  =  N0h[S].   Note  that  m  >  hJ"S].   Further  if  we  back  substitute  S  for  J  = 

1,2 N  and  get  S  ,  we  have  T  =  h[ S  ].  As  stated  before  (Section  4.1), 

N2h[S]  >  h[Sb],  or  TQ  >  T±   >  Tg  >  T  . 

general  we  have 
L:   for  I   :=  1  step  1  until  N,  do 

for  I   : =  1  step  1  until  N  do 


for  i   :=  1  step  1  until  N  do  S; 

n      *-   n  — 

re  S  is  an  assignment  statement.   Then  the  computation  of  L  takes  T  = 

n 

7T  N.  m  steps  as  it  is  presented,  where  we  assume  that  m  arithmetic  operations 
] 

are  involved  in  the  computation  of  S.   If  S  can  be  computed  for  all  values  of 

I  =  1,2, ...,N.  )  simultaneously,  then  the  computation  time  can  be  reduced  to 

K 


T,  ~     N.lm)  stens,  i.e.  N.  statements  can  be  computed  simultaneously 

, I  , ...,I  _,I,  .,..., I  change  sequentially.   In  general  there  are  n 

possibilities,  i.e.  we  examine  if  S  can  be  computed  in  parallel  with  respect 
to  I  for  k  -•  1,2,  ...,n.   Let  P  =  [k|o  can  be  computed  in  parallel  with  respect 

to  I,  .   hen  we  would  compute  3  in  parallel  with  respect  to  I  where  Ta  = 
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<Z5> 


I=N 


C3> 


(a) 


(b) 


J: =J+1 


J 


T 
m 


<^> 


-^  ^J+l  ) 


h[S] 


3 


h[sb] 


(c) 


(d) 


3 
(e) 


Figure  k.2.      Loop  Analysis 
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min  T  .   Clearly  each  statement  of  the  resultant  N  statements  can  be  computed 
kcF  k  g 

by  building  a  tree.   Further  if  it  is  appropriate  we  perform  back  substitution 

and  obtain  a  big  tree  as  the  above  example  (El)  illustrates. 

If  a  loop  is  a  limit  loop  which  terminates  when  e  <  6  for  some  pre- 
determine! B  and  computed  c,    it  may  be  approximated  by  a  counting  loop  (e.g. 
or  T  :-  1  is  - er  1  until  II  do)  which  is  executed  a  fixed  number  of  times  before 

"est  is  made,  and  then  repeated  if  the  test  fails. 

Consider  a  program  containing  n  two  way  forward  jump  statements  (or 

if  statements).   Let  the  tests  for  jumps  be  Boolean  expressions  B„,B  ,...,B  . 
—  1  2'     n 

Assume  that  there  are  m  output  variables  from  the  program  given  as  expressions 

,A_,  ...,A  ,  where  parts  of  A.  may  depend  on  B.'s.   In  a  program  when  B.  is 
1'  2'   '  nr  l  j  j 

encountered,  one  of  two  choices  is  taken  depending  on  the  value  of  B . .   It  is 

possible  to  start  computing  all  of  these  possible  alternatives  at  the  earliest 
time,  and  choose  proper  results  as  soon  as  values  of  B.'s  become  available. 

For  example 

a  :=  g  +  c; 

B  :   a  >  ':): 

if  B  then  d  :=  e  +  f  +  s  else  d  :=  a  +  g  +  t; 

:  i  +  f  +  i  x  j  x  k  x  p  x  q; 

yield 

:=  B  x  (e+f+s)  +  (not  B)  x  ((g+c)+g+t)  +f+ixjxkxpxq 
or 
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h  :=  ((g+c)  >  0)  x  (e+f+s)  +  ((g+c)  <  0)  x  ((g+c)+g+t)  +  f  +  i  x  j 
x  k  x  p  x  q, 
where  we  let  B  =  1  for  true,  B  =  0  otherwise.   Then  we  may  build  a  tree  for  h 
as  follows . 


gcefsg     eg     t      f       ijkp      q 
Figure  k.J.     A  Tree  with  a  Boolean  Expression 

The  box  \B  •  produces  0  or  1  depending  on  the  value  of  (g+c>0). 

In  general,  Boolean  expressions  can  be  embedded  in  arithmetic 
expressions  as  shown  in  the  above  example,  and  a  minimum  height  tree  can  be 
built  for  it. 

h.k     Error  Analysis 


In  this  section  parallel  and  serial  computation  are  compared  in  terms 
of  error.   We  are  only  concerned  with  a  generated  error,  i.e.  an  error  which  is 
introduced  as  a  result  of  arithmetic  operations.   It  is  shown  that  in  general 
parallel  computation  would  produce  less  error  than  serial  computation.   It 
is  also  shown  that  distribution  would  not  increase  the  size  of  an  error 
significantly.   Let  co  represent  any  arithmetic  operation.   In  general,  we  do 
not  perform  the  operation  co  exactly  but  rather  a  pseudo-operation  (jx>)  .   Hence 
instead  of  obtaining  the  result  xo^y,  we  obtain  a  result  x(a?)y.  We  may  write 
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x0y  =  (^y)(i+e  )  (1) 

where  <:   represr  I   in  error  introduced  by  performing  a  pseudo-operation.   For 

example,  we  have 

-:  0  y  =  (x+y)(l-f€a) 

and 

x  Q  y  =  Cxy)(l+em). 

Let  us  write  A  for  an  approximation  to  an  arithmetic  expression  A  with  an 
error  obtained  by  computing  A  using  pseudo  operations,  e.g.  fy\    or  (+)    .   Then 
^an  also  be  written  as 
(xuoy)   =  (xo>y)(l+e  ). 

Now  let  us  consider  the  computation 

A  =   la.. 
i=l  X 

First  we  compute  A  serially,  i.e. 

A  =   . . .f((a1+a2)+a,)+a^)+...+aN). 


.e  have 
* 


=  ai  0  a2  =  (a1+a2)(l+e&) 

-  &     +   a2  +  e  (a»j+a  ), 

{+)   a  =  (a  +a  +e  (a  +a,,)+a  )(l+€  ) 

J  3  -L^SlJ-.^       3, 


=  a,  +  a,-  +  a,  *■   e  (2a., +2a^+a-. ) . 
1    2    3    a   1   2  3 


•  higher  terms  of  e   are  neglected. 

St 
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#         * 


\  =  A3  ©  %  =  al  +  a2   +  a3  +  \  +  ea^ai+5a2+2a3+ai+), 


\  =  \-l   O   *N  =    ._\ai   +  ea((N-l)a1+(N-l)a2+(N-2)a3+...+aN) 


N 


2  a.    +  €    (Na  +(N-l)a  +(N-2)a,+. .  .+0 
i=l 


We  let  E  =  e  (Na  +(N-l)a  +(N-2)a  +. . . +a  ) .   Next  let  us  compute  A  in  parallel, 
i.e.  by  building  a  tree: 


A 


Without  loss  of  generality  we  assume  that  N  is  a  power  of  2.   Then 
AI2  ■  al  ©  a2  =  al  +  a2  +  ea(Va2) 
1-k   =  A12  ©  Alk   =  al  +  a2  +  a3  +  %  +  2ca(al+Va34V 
1-8  =  Al-ii  0  A5-8  =  al  +  a2  +  •"  +  a8  +  3ea(a1+a2+...+a8) 


A 


A 


Al-N  = 


l-N/2   O  V+l-N  "   J/i    +  ri0g2N1    £a    .f^i 
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We  let  -r    =  r log  Nl  €    £  a. .   To  compare  E  with  E  ,  let  a  =  a  =  a0  =  ...  =  a 


Then  we  get 


N' 


and 


a     2         a 


::rlog2Nl  a  •  ea, 

or 

ES  >  EP. 
a    a 

N 
An  error  for  B  =  it   b.  can  be  analyzed  in  a  similar  manner.   In  this  case  we 
i=l  ^ 

N 

ES  --  EP  =  (N-l)  e   Trb. 
m    m         m  .  ,  i 

i=l 

Hence,  Ln  general,  we  could  expect  that  parallel  computation  produces  less 

error  than  serial  computation. 

* 
hat  if  higher  terms  of  e  and  e  are  neglected,  then  A  can  be 

a      m  ' 

written  as 

A  +  i     E  (A)  +  (   v  E  (A) 

a    a       mm 

where       an  I   ''A)  are  arithmetic  expressions  consisting  of  variables  in  A. 

For  example,  if  we  compute  A  =  afbc+d)  serially  (i.e.  A  =  a((bc)+d))  then  we 
get 
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A     =    (ax((bxc)(l+€m)+d)(l-f€    ))(l+€    ) 


~  a(bc-d)    +  e   a(bc+d)  +  €    (2abc+ad), 
am  " 


and  E    (A)   =  a(bc+d) 


and  E   (A)    =   2abc+ad. 


m 


Usually  E    (A)   and  E   (A)    depend  on  how  A  is   computed  as  we  have  shown  for 

A  --   E  a.. 

Now  let  us  compare  parallel  computation  of  two  arithmetic  expressions 

A  and  A  ,  where  A  is  the  resultant  expression  obtained  by  applying  the 

distribution  algorithm  on  A,  in  terms  of  a  generated  error.  Note  that  we  can 

write 

A  =  A  +  e   x  E  (A)  +  e  x  E  (A) 
a    a      m    m 

and 

Ad*  =  Ad  +  e  x  E  (Ad)  +  e   x  E  (Ad) 
a    a   '  m    nr 

=  A  +  g   x  E  (Ad)  +  e  x  E  (Ad). 
a    a       m    m 

/  n  d 

As   an  example   let  us   study  A  =  a(bc+dj    +  e  and  A     =   abc   +  ad  +  e. 


bed  e 

(a)     A  =   a(bc+d)    +  e 


(b)     A     =  abc   +  ad  +  e 


Figure  k.k.      Trees  for  a(bc+d)  +  e  and  abc  +  ad  +  e 
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Then  we  have 
* 


A     =    (a(bc(l+€    )+d)(l+e    )(l+€    )+e)(l+€    ) 
m  a  nr      ' v        ay 

=   (a(bc+d)+e)+e    (2abc+2ad+e)+e    (2abc+ad) 
a  m. 


and 


Ad*  =    (abc(l+€    )2   +  (ad(l+e    )+e)(l+e    ))(l+€    ) 
m  m  &'  J  a 

=    (abc+ad+e)   +  e    (abc+2ad+2e)    +  e    (2abc+ad). 

Note  that  E  (A)  =  E  (A  )  in  the  above  example,  which  is  not  mere  chance.  We 
can  show  that  this  holds  for  all  cases. 


Lemma  2 : 


E  (A)   =  E  (Ad) 
m       m 


Proof : 


where 


First  let  us  consider 

t#a  \   ©  *2  ©  ••'  ©  *n 


t*   =  t.  +  €  E  (t.)  +  e  E  (t. ). 

1    i    aai     mmi 


Then  clearly 

n 

E  (t)  =  €    E  E  (t.) 

m      m  .  .,  m  l 
i=l 

regardless  of  the  order  of  additions  whereas  E  (t)  depends  on  the  order  of 

additions.   Hence  we  may  write 

*         n  n 

t  =  Z   t.  +  e  E  (t J  +  e    ZE  (t.) 
.  .,  i    a  a  Z  m  .  n  m  i 

i=l  1=1 

where  Z  indicates  that  E  (t)  depends  on  the  order  of  additions. 
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and 


Now  let  us  consider 

A*  =   t*   0    (tj*    ©   t*    ©    ...    ©    t*) 


A       =  t     0   tx    ©  t     ©   t2    ©    ...    ©   t     0    tn 


where 


and 


t     =   t   +  e   E    (t)    +  e   E    (t) 
a  a  mm 


t.      =  t.    +  e   E   (t.  )   +  e   E   (t. ). 
i  i  a  a     l  m  irr    i' 


Then  we  have 


n 


n 


and 


Hence 


A     =    (t-K E(t)+e    E(t))(    Z  t    4*    E    (t    )+€         Z  E(t ,)(l4€ni)) 

aa           mm           .    .,    i  a  a     i.      m..mi           m 

1=1  1=1 

n                      n  n 

=  t      St.    +  e    (    Z  (t.E    (t))  +  tE    (t    ))+  €    (    Z  (tE    (t    )+t  E   (t)+tt.)) 

.    ,i         a   .    ,      i  a  a     a           m.,miim             i 

i=l                  i=l  1=1 


A        =    (t4€    E    (t)+€    E    (t))(t14€    E    (tJ+€    E    (t1))(l+€    )    (+)    ... 
a  a  mm  1     a  a     1       mm     1  m     v-/ 

=    (ttn    +  e    (tE    (tJ+tnE    (t))    +  e    (tE    (tJ+t.E    (t)+ttn))    (+) 
v      1         a       a     1       1  a     '  m       m     1       1  nr  1"    v-^ 


t   Z  t,    +  e   E    (a£)    +  €    (    Z  (tE    (t    )+t ,E    (t)+tt    )) 
.    ,    l  a  a     L  m   .    ,        mi        lm  l 

i=l  1=1 


E(A)   =  E    (Au). 
m  m 


(Q.E.D.) 
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As  for  E  (A)  and  E  (A  ),  they  depend  on  the  order  of  additions  and  cannot  be 

compared  simply.   However,  they  may  not  differ  significantly.  As  a  simplified 

case,  let  us  study  the  following: 

N 
A  =  t(  £  a. ) 

i=l  X 

and 

H      N 

A  =   ^  (ta.). 
i=l   X 

Again  we  assume  that  N  is  a  power  of  two.   Then  to  compute  A,  we  first  compute 

:;  n   ^   n  n 

T.   a.  in  parallel.  As  we  showed  before,  (La..)     =     L   a.  +  riog_Nle_  L   a.. 
i=l  i=l      i=l  i=l 

Hence 

*  N  N 

A     =   t(    L  a . +r log^Nl e  Za.)(l+e    ) 

.    ,    i  d       a  .   _    i  m 

i=l  i=l 

N  N  N 

=   t   L  a.    +  e    r  log  Nit   Z  a.    +£    t   Z  a.  . 
.    ,    l  a  .    _    i         m  .    -    l 

i=l  i=l  i=l 

On  the   other  hand,    we  have 

A.    =    (ta. )      =   ta.    +  e   ta. 
11  l  mi 

d+ 
and  A   is  obtained  by  summing  A.  in  parallel,  i.e. 

Ad*  =  (..(((a*  ©  a2)  ©  (A;  ©  a*))  ©  ((A;  ©  $  ©  (a;  © 

Ag)))...) 


N                N  N   . 

=  t  Z  a.  +  e  riog^Nlt  Z  a.  +  e  t  Z  a.  . 

.,i    a   &2   •-,!  m  .  .  i 

i=l             i=l  i=l 


Hence  in  this  case  E  (A)  =  E  (Aa)  as  well  as  E  (A)  =  E  (A  ). 

a       a  mm 
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5-   PARALLELISM  BETWEEN  STATEMENTS 

This  chapter  should  be  read  as  an  introduction  to  the  following 
chapter  which  discusses  loops  in  a  program.   In  this  chapter  we  study 
parallelism  between  statements,  i.e.  inter- statement  parallelism.   Given  a 
loop  and  jump  free  sequence  of  statements  (we  call  this  a  program),  it  is 
expected  that  they  are  executed  according  to  the  given  (i.e.  presented)  order. 
However  if  two  statements  do  not  depend  on  each  other,  they  may  be  executed 
simultaneously  in  hopes  of  reducing  the  total  computation  time.   In  general, 
statements  in  a  program  may  be  executed  in  any  order  other  than  the  given 
order  as  long  as  they  produce  the  same  results  as  they  will  produce  when  they 
are  executed  in  accordance  with  the  given  sequence.   In  this  chapter  we  give 
an  algorithm  which  checks  if  the  execution  of  statements  in  a  program  by  some 
sequence  gives  the  same  results  as  the  execution  of  statements  by  the  given 
sequence  does.   Also  a  technique  which  exploits  more  parallelism  between 
statements  by  introducing  temporary  locations  is  introduced. 

5-1   Program 

A  program  P  with  a  memory  M  is  a  sequence  of  assignment  statements 
S(i),  i.e.   P -=  (S(l);  S(2);  ...;  S(i);  ...;  S(r))  where  i  is  a  statement 
number  and  r  is  the  length  of  a  program  P  (we  write  r  =  lg(P)).   The  memory 
M  is  a  set  of  all  variables  (or  identifiers)  which  appear  in  P. 

Associated  with  each  S(i)  is  a  set  of  input  variables,  IN(S(i))  and 
an  output  variable,  OUT(S(i)).   Then  M  =.U,  (lN(S(i))  UOUT(S(i))).   Further 
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we  define  two  regions  in  a  memory;  a  primary  input  region  M  and  a  final 

output  region  M  as 

Mj.  =  (m  |  meIN(S(i))  and  Vk  <  i,  m/OUT(S(k) )} . 

and  M  =  {m  |  meOUT(S(i)  )1 . 

A  program  uses  the  values  of  those  variables  in  VL.   as  primary  input  data  and 

puts  final  results  into  M  . 

C(m)  refers  to  a  content  (value)  of  a  variable  m.   C(M)  refers  to 
the  contents  of  variables  in  the  memory  M  as  a  whole  and  is  called  a  config- 
uration of  M.  Also  CT(m)  refers  to  a  value  which  m  has  before  a  computation 

(i.e.  an  initial  value  of  m).   Thus  C(M  )  refers  to  primary  input  data  given 

to  a  program.  We  call  it  an  initial  configuration. 

The  following  relations  are  established  among  statements  in  P. 
A  triple  (id,  i,  j)  where  id  e  M  (id  for  an  identifier)  and  i,  j  e 
{0,1, . . .  ,r,r+l]  (r  =  lg(P))  is  in  the  dependence  relation  DR(P)  if  and  only  if: 
(1)   (i)  i  <  j  and 

(ii)      id  e    0UT(S(i))   and  id  e   IN(S(j))   and 
(iii)     Vk,    i  <  k  <  j,    id  ft   0UT(S(k)), 
or   (2)      (i)      i  =  0  and 

(ii)      id  e   IN(sCj))   and 
(iii)     Vk,    0  <  k  <  j,    id  I  0UT(S(k)), 

(s(,j)  is  the  first  statement  to  use  id), 
or  (3)   (i)  $   =  r  +  1  and 

(ii)  id  e  0UT(S(i))  and 
(iii)  Vk,  i  <  k  <  r  +  1,  id  /  0UT(S(k)). 


3.214- 


only  if: 


(S(i)  is  the  last  statement  to  update  id). 
Similarly  a  triple  (id,  i, j)  is  in  the  locking  relation  LR(P)  if  and 

(i)  i  <  j  and 

(ii)  id  e  IN(S(i))  and  id  e  OUT(S(j))  and 
(iii)  Vk,  i  <  k  <  j,  id  /  CUT(s(k)). 

Example  1  (The  notations  follow  ALGOL  60|  3] ) : 
Let  P  be 


S(l):  a 

I:  d 

):  f 

):  g 


=  b  +  c; 

-  a  +  e; 
=  g  +  d; 

-  h  +  i. 


Tli  en 


DE(P)  =  {(b,0,l),(c,0,l),(e,0,2),(g,0,3),(h,0,i|),(e,0,^),(a,l,2), 

(d,2,5),(f,3,5),(g,^,5))  and 
LR(P)  =  [(g,3,^)). 

Since  we  are  only  interested  in  meaningful  programs,  we  assume  that 
there  is  no  superfluous  statement,  i.e.  there  is  no  id  e  M  such  that 

(i)  id  e  OUT(S(i))  and  id  e  OUT(S(j))  where  i  <  j,  and 
(ii)  Vk,  i  <  k  <  j,  id  /  IN(S(k)). 
Also  v/e  assume  that  there  is  no  statement  that  has  no  inputs  other  than  constant 
numbers,  e.g.  "a  :=  5" • 

Now  we  define  an  execution  order  E  of  a  program  P  as : 
E(P)  =  {(i,j)|ie{l,2,...,lg(P)},je(l,2,...}}. 
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JL 

We  also  write  E  (i)  =  j  if  (i, j)  e  E(T).W        To  execute  a  program  by  E(P)  means 

that  at  step  j,  all  statements  with  statement  number  E_  (j)  are  computed 

simultaneously  using  data  available  before  the  j-th  computation  as  inputs. 
A  pair  (P,  E)  is  used  to  denote  this  execution.   Also  by  En(P),  we  understand 

the  execution  order  given  by  a  program,  i.e. 

E  (P)  =  ((i,i)  |  Vi  e  [l,2,...,lg(P))}. 
En  is  called  a  primitive  execution  order. 

We  assume  that  at  each  time  step  at  least  one  statement  of  P  must 
be  executed.   That  is,  for  any  E  there  is  k  such  that  Vj  >  k,  E   (j)  =  empty 

and  Vj  <   k,  E   (j)  /  empty.  We  call  k  the  length  of  an  execution  and  write 
lg(E). 

As  stated  before,  C(OUT(S(i)))  refers  to  the  contents  of  a  variable 
OUT(S(i)).   This  value,  as  we  expect,  varies  from  time  to  time  throughout  an 
execution.   Thus  it  is  essential  to  specify  the  time  when  a  variable  is  referred 
to. 

S(i)(P,E)  refers  to  a  computation  of  S(i)  of  P  in  an  execution  (P,E). 
C(m)  after  S(i)(P,E)  refers  to  the  value  of  a  variable  m  right  after  S(i)(P,E). 
C(m)  after  (P, E)  refers  to  the  value  of  a  variable  m  after  an  execution  of  a 
whole  program. 

5-2  Equivalent  Relations  Between  Executions 

Now  we  define  two  equivalent  relations  between  executions. 


J 

For  convenience  we  define  that  Vi,  E_(0)  <  E_(i)  and  E-^i)  < 


Ep(lg(P)+l) 
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Definition  1; 

Given  a  program  P  and  two  execution  orders  E  and  E  ,    (P,E,  )  and 

(P, E_)  (or  simply  E..  and  E?)  are  said  to  be  output  equivalent  if  and  only  if: 

for  all  initial  memory  configurations  C_(M_), 

Vi  C(OUT(S(i)))  after  s(i)(P,E1)  =  C(OUT(S(i)))  after  S(i)(P,Eg). 

We  write  (P,E  )~(P,E  )  if  (P,E.)  is  output  equivalent  to  (P,E  ). 

Definition  2: 

Given  two  programs  P  and  P  ,  let  their  execution  orders  be  E,  and 

Ep  respectively.  Also  let  their  memories  be  M,  and  Mp.   Then  two  executions 

(P  ,E,)  and  (P0,E  )  are  .said  to  be  memory  equivalent  if  and  only  if: 

(l)  there  is  a  one-to-one  function 

fi<"ll  U  V  -  lMSI  U  «20> 

such  that 

f(M  )   =  M 

1  II     21 

and  f(M1Q)  =  M2Q, 

and  (2)  for  all  initial  memory  configuration  pairs  C_(M-]_j)  and  CT(M?T) 

such  that 

Vm  €  Uir    CjU)  =  CI(f(m)). 

Vn  €  M1Q, 


C(n)  after  (P^E^  =  C(f(n))  after  (P2,E2). 


M 


We  write  (P  ,E,)~(P  ,E  )  if  (P-,E,  )  is  memory  equivalent  to  (P  ,E p). 
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In  principle,  a  program  is  written  assuming  that  it  will  be  executed 
sequentially,  i.e.  by  EQ.   It,  however,  need  not  necessarily  be  executed  by 
E~  as  long  as  it  produces  the  same  results  as  (P,En)  when  it  terminates,  i.e. 


(V 


lM, 


it  may  be  executed  by  any  E  as  long  as  (P,E)=(P,E_)  holds. 

Now  the  following  theorems  can  be  proved  directly  from  the  above 
definitions. 


Theorem  1 


(P,E)=(P,E  )  if  and  only  if: 

(1)  Vi,  (id, i,.j)eDR  implies  that  E  (i)  <  E  (j). 
and  (2)  for  any  two  triples  (id,  i,  j)  and  (id,  i',  j ' )  in  DR  with  the 
same  identifier  id,  either  Ep(j')  <  E  (i)  or  E  (j)  <  E  (i') 

holds. 


What  condition  (l)  implies  is  that  variables  must  be  properly  up- 
dated before  used,  and  condition  (2)  prevents  variables  from  being  updated 
before  they  are  used  by  all  pertinent  statements. 


(a)  Condition  (l) 


«p(i)  0 

A 

\ 

Ep(j)      Q 

id 

i) 

(b)  Condition  (2) 


Ep(i)        ©                     Kp(iO 

© 

A                  i  id                   A 
Ep(d)        ©                       Ep(j«) 

4      id 

© 

/A                                   or     /A 

Ep(i')            ©              Ep(i) 

0 

1     ld                      A 

A                                 A 

Ep(j')             ©                Ep(d) 

4      id 

0 

Figure  5»1«  Conditions  for  the  Output  Equivalence 
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Proof  of  Theorem  1: 


(1)  if  part: 

Assume  that  a  statement  S(i)  receives  data  from  statements 
S(0,S(i  ),  . .  .,S(i,  ),  i.e.  for  each  pair  i  and  i  (s=l,2, . .  .,k) 

there  is  an  identifier  id  such  that  (id  ,i  ,i)eDR.  Now  let  E 

— s  — s  s 

be  an  execution  order  which  guarantees  that  (l)  before  S(i)  is 

computed,  all  S(i  )  are  computed,  and  (2)  between  the  computation 
s 

of  S(i  )  and  S(i),  no  statement  updates  id  ,  then  it  is  clear 
s  s 

that  (C(OUT(S(i)))  after  S(i)(P,E)  =  C(OUT(s(i)))  after  S(i)(P,EQ) 
providing  that  all  OUT(S(i  ))  have  appropriate  values.  Note  that 

5 

the  above  two  requirements  are  equivalent  to  conditions  (l)  and 

(2)  of  the  theorem.  Then  by  induction,  we  can  show  that  if 

0 
conditions  (l)  and  (2)  hold  for  all  statements,  then  (P,  E)~(P,  E~). 

(2)  only  if  part: 

We  give  an  example  to  show  that  if  an  execution  order 
violates  condition  (l)  or  (2)  then  we  cannot  get  an  output 
equivalent  execution.   Now  let  P  be 

S(l):   a  :=  b; 

S(2):   c  :=  a; 

S(3):  b  :=  e. 
Then  DR  =  [(b,0,l),(e,0,l),(a,l,2),(c,2,4),(b,3A)},  and  (P,EQ) 

gives 
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C(0UT(S(1)))  after  S(l)(P,EQ)  =  C^b), 

C(0UT(S(2)))  after  S(2)(P,EQ)  =  C^b), 

and  C(0UT(S(3)))  after  S(3)(P,EQ)  =  C^b). 

Now  let  E  (P)  =  {(1,2), (2,1), (3,3)}  which  violates  the  first 

condition  of  the  theorem,  and  Eg(P)  =  { (l,  2),  (2,3),  (3,1))  which 

violates  the  second  condition.   Then 

C(0UT(S(2)))  after  S(2)(P,E1)  =  C  (a) 

and  C(0UT(S(1)))  after  S(l)(P,Eg)  =  C  (e) 

which  do  not  agree  with  corresponding  values  produced  by  (P, E»). 

(Q.E.D.) 


Theorem  1  gives  more  meaningful  executions  compared  to  the  previous 
results  [5l  [10].   For  example  let  P  be: 

S(l):  a:=f1(x) 

8(2):  b:=f2(a) 

8(5):  c:=f3(b) 

S(k):   b:=fu(x) 

S(5):  d:=f  (b,c). 

Fisher  [10],  for  example,  would  give  the  following  execution  (P,E) 
as  an  "equivalent"  execution  to  (P,E0). 


130 


Step 

6)      Q 

1     aJ    |b        E=  ((1'1)'  (2>2)>  (5>5)'  ^'V'    (5A)) 

s  I-  - 

This,  however,  does  not  give  correct  results  unless  P  is  properly 
modified.  Note  that  the  variable  b  carries  two  different  values  between 
steps  2  and  3  which  is  physically  impossible.   Theorem  1  does  not  recognize 
such  an  execution  as  "equivalent"  to  (P,En). 

Theorem  'c 

M 
(P,E)=(P,EQ)  if  and  only  if 

(1)  Vi,  (id,  i,  j)  eDR  implies  that  Ep(i)  <  Ep(j), 

and  (2)  V.,  (id,  i,  j)  eLR  implies  that  E  (i)  <  E  (j). 

J 

Example  2 : 

P:  S(l):  a:=b+c; 

S(2):  d:=a+e; 

S(3):  a:=q+r; 

SCO:  h:=a+s. 
Let  E(P)  =  {(3,1),  (U,2),  (1,3),  (2,i+)}-   Then  (P,E)£(P,EQ). 

E,  however,  violates  the  second  condition  of  Theorem  2,  i.e.  (a, 2, 3)  eLR  but 
Ep(2)  <Ep(3). 

The  following  lemma  is  helpful  to  prove  the  theorem. 
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Lemma  1; 

If  two  conditions  of  Theorem  2  hold  for  an  execution  order  E,  then 

(P,E)-(P,EQ). 

Proof: 

We  show  that  if  conditions  (l)  and  (2)  of  Theorem  2  (we  write  C(2-l) 
and  C(2-2)  for  them)  hold,  then  conditions  (l)  and  (2)  of  Theorem  1  (c(l-l) 
and  C(l-2))  follow. 

First  note  that  C(2-l)  is  identical  to  C(l-l).   Next  we  show  that 
C(2-2)  together  with  C(2-l)  satisfy  C(l-2).  Assume  that  (a, i^ j1), (a,i  , J  ) 

eDR  where  j.  <  ip.   Then  there  exist  statements  S(h,), S(k.), S(h?), S(k?), . . ., 

S(hm),S(km)  such  that  (a,  j^l^),  (a,^,^),  (a,^,^), . . .,  (a,kg,h  ^  . . .,  (a,km,i2) 

eLR  and  (a^,^),  (a,h2,k2),  ...,  (a,h,k  ),...,  (a,hm,km)  eDR.   Then  if  C(2-l)  and 

C(2-2)  hold,  then  E  (j  )  <  E  (i  ).   Thus  C(l-2)  follows. 

(Q.E.D.) 

Proof  of  Theorem  2; 

(1)      if  part: 

Let  E  be  an  execution  order  which  satisfies  C(2-l)  and 

0 
C(2-2).   Then  by  Lemma  1,  (P,E)"(P,EQ).   Now  let  i  be  a  statement 

number  such  that  (id,i,r+l)  eDR  where  r  =  lg(P),  i.e.   S(i)  is 

the  last  statement  in  (P,E-.)  which  updates  id.   Then 

C(id)  after  S(i)(P,EQ)  =  C(id)  after  (P,EQ).      (l) 
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Also  by  a  similar  argument  used  to  prove  Lemma  1,  we  can  show 
that  for  all  J  such  that  id  e  OUT(s(j)),  E  (J)  <  E  (i)  holds. 

Thus  S(i)  is  the  last  statement  to  update  id  in  (P,E),  too.  Thus 

C(id)  after  S(i)(P,E)  =  C(id)  after  (P,E).       (2) 

0 
Also  since  (P,E0)=(P,E),  we  have  C(CUT(S(i)))  after  S(i)(P,E  )  = 

C(OUT(S(i)))  after  S(i)(P,E)  or 

C(id)  after  S(i)(P,EQ)  =  C(id)  after  S(i)(P,E).  (3) 

Thus  from  (l),  (2)  and  (3),  we  have  C(id)  after  (P,E)  =  C(id) 

after  (P,  En).   Using  the  same  argument  for  all  i  such  that 

(id,  i,r+l)  eDR,  we  can  show  that  for  all  m  e  M  ,  C(m)  after 

(P,E)  =  C(m)  after  (P,E  ). 

{?.)      only  if  part: 

This  part  is  again  proved  by  giving  a  counter  example.   It 
is  easy  to  show  that  a  program 


S(l): 

a    :=   e; 

8(2): 

c    :=  b; 

8(3): 

b    :=  a. 

together  with  execution  orders  E,  =  {  (1,3),  (2,1),  (3,2)]  and 
Ep  =  ( (1,1),  (2,3),  (3,2)}  serves  as  a  counter  example.   The 
details  are  omitted. 


(Q.E.D. ) 
Note  that  Lemma  2  can  be  modified  as : 
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Corollary  1; 

If  (P,E)=(P,EQ),  then  (P,E)=(P,EQ) . 

Now  let  us  study  the  memory  equivalence  relation  between  two 
different  programs  and  execution  orders,  (P,,E  )  and  (P  ,E  ),  in  detail.  As 

a  subcase  of  this  let  us  consider  (P  ,E_)  and  (P  ,E  ).   In  general  it  is 

M 
impossible  to  show  whether  (P  ,E  )=(P  ,E  ).   That  is,  this  problem  can  be 

reduced  to  the  Turing  machine  halting  problem  which  is  known  to  be  recursively 

unsolvable  [28].   In  our  discussion  we  have  put  the  restriction  on  P  so  that 

a  program  is  a  loop-free  block  of  assignment  statements.   Even  with  this 

restriction  it  may  still  be  practically  impossible  to  show  memory  equivalence 

between  (P..,En)  and  (P  ,E„).   For  example  if  P  and  P  are  different  polynomial 

approximations  for  the  same  function,  then  (P  ,  E  )  and  (P  ,E  )  are  likely 

to  produce  different  results  due  to  e.g.  a  truncation  error.  We  do  not  pursue 

this  problem  further. 

Finally  let  us  consider  the  following  example: 


Example  3 : 

P:     S(l): 

k 

:=  a; 

S(2): 

b 

:=  k; 

S(3): 

k 

:=  c; 

SflO: 

d 

:  =  k; 

S(5): 

k 

:=  e; 

S(6): 

f 

:=  k. 

Let  E(P)  =  {(1,1),(2,2),(3,2),(U,3),(5,3),(6,U)} 


M 

Then  (P,E0)~(P,E)  and  lg(E) 


k. 


I3h 


However,  the  following  program  P' 
P':  S(l):  k»  :=  a; 
S(2):  b  :=  k*; 
S(3):  k"  :=  c; 


S(k):  d 
S(5):  k 
S(6):  f 


=  k"j 

=  e; 

=  k. 


together  with  an  execution  order  E(P')  =  { (l,l),  (2,2),  (3,1),  (1<.,2),  (5,1),  (6,2)} 
gives  a  memory  equivalent  execution  to  (P,En),  and  lg(E(P'))  =  2. 

This  suggest  the  introduction  of  the  following  transformation,  which 

M 
when  applied  on  a  program  P,  produces  a  new  program  P'  such  that  (P,En)~(P',En) 


Transformation  T 

Let  S  -  (S(i),S(i+l),  ...,S(j)}  and  S2  =  {S(k),  S(k+l), . . .,  S(m))  where 

j  <  k.   Assume  that  there  is  an  identifier  id  such  that  (id,  j,k)  €LP,  and  id 
e  OUT(S(i)),  and  Vu,  i  <  u  <  j,  id  /  OUT(S(u)).  Also  assume  that  for  any  v  and 
w>  i  £  v  £  J  and-  k  <  w  <  m,  there  is  no  id'  such  that  (id',v,  w)  eDP.   Then 
replace  every  occurrence  of  id  in  S,  by  id'  where  id'  ^  M. 

Gold  [17]  presented  a  similar  transformation  to  describe  his  model 
for  linear  programming  optimization  of  programs. 

After  the  transformation  is  applied  S,  and  S  can  oe  processed  in 


M 

parallel,  and  still  (P,E  )~(P',E  )  holds  where  P'  is  the  result^ic  program  a: 

the  application  of  T,  on  P. 

This  shows  that  the  second  condition  of  Theorem  2  is  not  essential, 
i.e.  it  can  be  removed  by  introducing  extra  locations  if  necessnry. 
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6.   PARALLELISM  IN  PROGRAM  LOOPS 

6.1  Introduction 

6.1.1   leplacement  of  a  jfpr.  Statement  with  Many  Statements 

Using  the  results  from  the  previous  chapter,  now  let  us  study  loops 
in  a  program,  e.g.  ALGOL  for  statements  or  FORTRAN  DO  loops,  to  extract  potential 
parallelism  among  statements.   Given  a  loop  P,  we  seek  an  execution  order  E  with 
the  minimum  length  among  all  possible  ones.   Sometimes  it  may  be  appropriate  to 

get  a  loop  P'  from  P  by  the  previously  introduced  transformation  for  which  there 

M 
is  an  execution  order  E'  such  that  (P',E* )=(P,E  )  and  lg(E' )  is  the  minimum 

(For  the  definition  of  =,  see  Chapter  5)- 

As  stated  before,  in  this  chapter  our  main  concern  is  the  parallelism 
among  statements  (inter statement  parallelism).   For  example,  we  are  interested 
in  finding  out  that  all  10  statements  (A[I]  :=  A[  I  +  1]  +  FUNC(B[I]);  (1=1, 2, ..., 
10))  in  Fl  can  be  computed  in  parallel,  whereas  statements  in  F2  cannot  be 
(The  notation  follows  ALGOL  60  [3]): 

Fl:   for  I  :=  1  step  1  until  10  do 

A[I]  :=  A[I  +  1]  +  FUNC(B[I]); 
F2:  for  I  :=  1  step  1  until  10  do 

A[I]  :=  A[I  -  1]  +  FUNC(B[I]). 

First  several  notations  are  presented.  According  to  the  ALGOL  60 
report  [3],  a  for  statement  has  the  following  syntax: 
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<  for  statement  >  ::=  <  for  clause  >  +  <  statement  >  + 

<  for  clause  >  ::=  for  <  variable  >  :»  <  for  Hat  >  do. 

An  instance  of  this  is : 

for  I.  :=  ...  do 

■  m   9 

for  I  :=  ...  do 

begin  EL  ;  S  ;  . . . ;  S  end. 

For  the  sake  of  brevity,  we  shall  write  (i.  *-Ln,  I_  *■  L_,  ....  I  *-  L  )  (S, ,  S^, 

112         2  .    n         n         1'     2 

. ..,    S   )  or  (I  ,    I  ,    ...,    I   )(S. ,    S_,    . ..,    S   )  for  the  above  for  statMMnt 

m      ±       d  n  ±       d  m  ■ 

instance  where  I,  is  called  a  loop  index,  L  is  an  ordered  set  and  called  a  loop 

list  set,  and  S  is  called  a  loop  body  statement  with  a  statement  identification 

number  p  (which  is  different  from  a  statement  number  (see  Chapter  5))- 

As  its  name  suggests,  a  loop  list  set  represents  a  <  for  list  >,  e.g. 
L,  =  (1,2, 3 A,  5,6)  represents  "I.  :=  1  step  1  until  6."  In  general  we  write 

L,  (i)  for  the  i-th  element  of  L  thus  L.  (|L.  |)  is  the  last  element  of  L,  . 

Now  to  facilitate  later  discussions  we  introduce  the  following  notation. 

Let  B  =  (b.,,b  .  ...,b  )  (b.  >  0  for  all  i)  and  (in,i  ,....i  )   be  n-tunles 
1  2     n    i  v  1  2     n' 

# 

of  integers.   Then  we  define  the  value  of  (i.,i0, ...,i  )  w.r.t.  B  as  follows.* 

— — —  id  n 


7r<A>+=<A><A>*  where  *  is  the  Kleene   star. 

JUL 

7nr  F or  convenience  we  write  i(s..t)  for  (t-s+l)  integers  i  ,  i   ,  ..,, 

S      S  ^*  A. 

i+  i)  i+,  e.g.  (i(l..s),  i(s+2..n))  means  (i.,....i  ,i  ^,...,i  ). 

L~-L   b  is  s+2      a 

Also  (|L(s..t)|)  means  ( |Lg | , |Lg+1 | , . . ., |Lt | ).   Finally  (i(n))  means 
n  i's  e.g.  (1(3))  =  (1,1,1). 
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n        n 
V  ((i(i..n))|B)  =   2  i.B  -   IB.  +1 

n+1 

where  B.  =  f  b.  ,.  and  B  =  b  ,,  =  1. 
j   ,  .  k+1     n    n+1 

This  notation  is  introduced  so  that  the  relations 
V((1,1,...,1,D!B)  =  1, 
V((1,1,...,1,2)|b)  =  2, 


V((l,l,...,l,b  )|B)  =  b  , 


n 


V((l,l,...,l,2,l)|B)  =  bn  +  1, 
V((1,1,...,1,2,2)|B)  =  bn  +2, 


hold. 


and  V((b1,b2,...,bn)|B)  =  b^x-.^ 


For  example  V(  (2,3,1)  I  (3A>  5) )  =  31.   An  n -tuple  B  is  called  a  base. 


The  inverse  function  of  V  is  also  defined  as  V~  (t | B)  =  (i(l..n))  if  V( (i(l. .n)) |b) 

=  t.   Note  that  V"1  is  not  one-one  e.g.  V_1(l5|  (J>,k,  5) )  =  (2,0,0)  or  V~1(15|  (3A,  5) ) 
=  (l,i+,0).   An  n-tuple  (i(l..n))  is  said  to  be  normalized  if  b  .  >  i.  >  0  for  all 

J        J 

n        n 
j.   Let  (i(l..n))  be  normalized.   Then  1  <  V( (i( i. .n) ) | B)  <  2  b.B.  -  2  B.+l. 

3-1  J  J   0=1  ° 

n        n  , 

If  1  <  t  <   Z  b.B.  -  ZB.+l,  then  V*  (t | B)  has  unique  normalized  (i(l..n))  as 

d-i  J  J  j=i  J 

its  value. 
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VTe  say  that  normalized  (i(i..n))  ranges  over  B  =  (b(l..n))  in 

n 
increasing  order  if  V(  (i(i.  .n) )  |B)  takes  all  values,  between  1  and  £  b.B.  - 

J=l  J  J 

n 

£  B.+l  in  increasing  order  as  (i(l..n))  changes.  Notationally  we  write 

J=l  J 
(l(n))  <  (i(l..n))  <  (b(l..m)). 

Finally  we  let 

(i(l..n))  >  (j(l..n))  if  V((i(l..n))|B)  >  V( (j(l. .n)) |b) 
and 

(i(l..n))  =  (j(l..n))  if  V((i(l..n))|B)  -  V((j(l..n))|B) 

The  following  lemma  is  an  immediate  consequence  of  the  above 
definition. 


Lemma  1: 


Let  B  =  (b(l..n))  where  V.b.  >  2.   Then 

li  — 

(1)  v((a(l..n))|B)  <  V((a'(l..n))|B)   implies  that 

n 
v((ai-  a1',...,an-  an')|B)  <  -  z  B.   or  v((a1-  e^', . . ,,a  -an' ) |b) 

<  V((0(n))|B). 

(2)  V((a1'+c1',...,an'+cn')|B)  =  V(  (a^,  . . . ,  an+cn)  |  B)  ana 

V((a'(l..n))|B)  >  V((a(l..n))|B)  imply  that  V((c--  C,',,.., 

n 
Cn'Cn')|B)  -  '  .ZBj  or  V((ci-c1'^-^cn-cn')lB)  <  V((0(n))|B). 

(3)  Let  0  <   [a.  |  <  b  for  all  j.   Then  V(  (a(l.  .n)  |b)  <  Vi(0(n) )  |b)  if 
and  only  if  there  is  h  such  that  Vk  (1  <  k  <  h),  a.  =  0  ana  a.  <  0. 
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A  loop  must  be  replaced  with  a  sequence  of  statements  so  that  we  can 
use  the  results  of  the  previous  chapter.   For  example  we  replace 

for  I  :=  1  step  1  until  10  do 

SI:  A[I]  :=  A[I]  +  B[I]; 
with  the  sequence  of  ten  statements 

A[l]  :=  A[l]  +  B[l]; 

A[2]  :=  A[2]  +  B[2]; 

A[10]  :=  A[10]  +  B[10]. 

/"        \ 

In  general  we  will  get  J  it   |L.|-nJ  statements  after  the  replacement 

of  a  loop  P: (I, .  I_, . . , I  )(S  -S^;..:S  ).   Any  statement  in  the  set  of  replaced 

r     1'  2     n   1'  2     m 

statements  can  be  identified  by  an  n-tuple  (i(l..n))  which  corresponds  to  values 
of  I1, I2, ...,In  (i.e.  L1(i1),L2(i2), . .,Ln(in)),  and  p  which  represents  a 

statement  identification  number.   Thus  an  (n  +  l) -tuple  (i(l..n),p)  serves  as  a 

statement  number,  and  we  write  S((i(l..n),p))  to  denote  a  particular  statement 

in  the  set  of  replaced  statements,  e.g.  in  the  above  example  S( (3, l) )  =   A[  3]  := 

A[3l  +  B[31-   The  actual  statement  which  corresponds  to  this  is  the  statement 

S  with  Ln(in ),...,  L  (i  )  substituted  into  every  occurrence  of  I,.....I  in  S  , 
p       ll'nn  1'   '  n     p 

and  we  also  write  S  [L, (i, ),...,  L  (i  )]  for  this. 

p  1  1  '     n  n 

/n     ,  \ 

These   ir  |L.  m  I  statements  are  to  be  executed  according  to  the 

presented  order  (i.e.  the  order  specified  by  for  loop  lists).   In  other  words, 
the  statement  S( (l, 1, . . ., 1, l) )  is  executed  first,  S((l, 1, . . ., 1,2) )  second,  ..., 
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the  statement  S( (i(l. .n),p) )   is  executed  V( (i(l. .n),p) | ( |L(l. .n) | ,m) )-th,  .. ., 
and  the  statement  S((  IL^J,  . . .,  |L    |,m))   is  executed  lastly.     Formally,    as  the 

essential  execution  order  we  have: 

EQ(P)  =   {((i(l..n),p),V((i(l..n),p)|B))    |    (l(n),l)  <  (i(l..n),p)  < 

(|L(l..n)|,m)) 
where  B  =    ( |L(l. .n) |  ,m) . 


Example  1: 


for  I,    :=   1  step  1  until  10  do 


for  I      :=  1  step  1  until  10  do 


begin 

SI:     A1[I,,I0]    :=  A2[I--1,I0]    +B1[I1,I0]; 


■1^2- 


'1 


•l'*2- 


S2:      B^I^Ig-l]    :=  A5[I1+1,I2]    +  B5[I1,  Ig+1] ; 

end 
is  executed  as 

S(  (1,1,1)):     AX[1,1]    :=  A2[0,1]    +B1[1,1]J 


S((l,l,2)) 
S(  (1,2,1)) 
S((l,2,2)) 


B2[1,0]    :  =  A5[2,l]    +  B3[l,2]; 


S((10,10,2)):     B2[10,9]    :=  A3[ll,10]    +B^[10,11]; 
The   superscript   is  used  to  distinguish  different  occurences  of  A  and  B. 
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A[(i(l..n))]  represents  a  form  in  which  L..  (i.  ),...,  L  (i  )  are 
substituted  into  T  ,  ..  .,1  in  index  expressions,  e.g.  in  the  above  example 

A2[(i1,i2)]  =  A2tix-l,i2] 

and  A2r(3,2)]  =  A2[2,2]. 

Finally  a  set  of  inputs  to  a  statement  S(  (i(l.  .n),p))  is  denoted  by 
IN(S((i(l..n),p))).  Similary  OUT(S((i(l. .n),p)))  represents  a  set  of  outputs 
from  S((i(l..n),p)).   From  the  above  example  we  have,  e.g. 

IN(S(l,l,2))  =  (A5[2,1]*B5[1,2]) 

and  0UT(S(1,1,2))   =    {B2[1,0]}. 

6.1.2  A  Restricted  Loop 

In  what  follows,  we  mainly  deal  with  a  restricted  class  of  for 
statements.   Two  restrictions  are  introduced.   Let  a  loop  with  m  body  statements 
be 


P™  =  (l1,I2,...,In)(S1;S2;...;Sm). 


Restriction  1: 


A  for  list  set  L.  must  be  an  arithmetic  sequence,  i.e. 

L  =  (.1,2,3,..., t)  (1) 


for  all  i 


Restriction  2 


Let  (A  ,A  , ...,A  1  be  a  set  of  all  array  identifiers  in  F  where  the 
h-th  occurrence  of  A,  in  P  has  the  following  form  (where  the  superscript  h  is 
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used  only  if  it  is  important  to  distinguish  different  occurrences  of  A,  ): 

A^[F(k,h,l),  F(k,h,2),  ...,  F(k,h,n)].  (2) 

For  fixed  k  and  j,  F(k,h,  j)  has  an  identical  form  for  all  h,  i.e.  either 
F(k,h,j)  =  I.  +  w(k,h,j) 

J 

or  F(k,h, j)  =  0  (i.e.  vacant). 
w(k,h, j)  is  a  constant  number.  Also  we  assume  that  each  A,  appears  on  the  left 

hand  side  of  statements  at  most  once. 

An  example  of  a  restricted  loop  is: 
for  I   :=  1  step  1  until  20  do 

for  I   :=  1  step  1  until  30  do 

for  I,  :=  1  step  1  until  kO   do 

begin 

SI:  A3[  1^1,12+3,0]  :=  AjCIylg-J,)*]  +  ^[0,0,1^]; 

S2:  A2[0,I2,I3-1]  :=  A?[  1^1,12,0]  +  A.^0,0, 1^1] ; 

S3:     A1[0,0,I5+1]    :=  A2[0,I2-1,I3]; 

end; 

Note  that,  for  example,  A,  always  appears  as 

A,[I,  +w(3,h,l),  I  +w(3,h,2),0],  thus  the  first  occurrence  of  A,  is 

A3[F(3,1,1),  F(3,l,2),0]  =  A3[  1^(3,1,1),  I2+w(3,l,2),0]  •  AjCXj-1,  Ig+3, 0]  - 

If  there  is  no  ambiguity,  we  write  e.g.  A,[I, -1, I2+3]  f°r  A,f  I, -1,  I2+3>0] 

(which  is  the  conventional  form). 
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We  also  write  F(k,h, j)(i)  for  the  resultant  expression  obtained  by- 
substituting  i  into  I.  in  F(k,h,j),  e.g.  A,[F(3, 1,  l)(2),  F(3,l,2)(3),0]  = 

A,[  1,6,0]  (=  A,[l,6]  conventionally). 

A  single  variable  may  be  introduced  as  a  special  case  of  array 

indentifiers,  e.g.  we  write 

for  I  :=  1  step  1  until  1  do 
A[I]  := 
for 

T  :=        .... 

6.2  A  Loop  With  a  Single  Body  Statement 
6.2.1  Introduction 

First  we  shall  deal  with  the  case  where  a  loop  has  only  one  body- 
statement  (i.e.  m  =  l).   Let  a  loop  with  a  single  body  statement  be  P  = 
(I  , I  , ...,I  )S.   Since  there  is  only  one  statement  we  may  drop  the  statement 

identification  number.   Then  a  statement  number  for  a  replaced  statement  becomes 
(i(l..n))  and  as  the  essential  execution  order  we  have: 

EqCP1)  =  {((i(l..n)),  V((i(l..n))|(|L(l..n)|))  | 

(l(n))  <  (i(l..n))  <  (|L(l..n)|)}. 
Also  in  this  case  we  only  have  to  consider  the  array  identifier  which 
appears  on  the  left  hand  side  of  S.   Hence  instead  of  s  array  identifiers  we 
only  have  one  array  identifier  (see  Restriction  2  of  Section  6.1.2).   Hence  we 
drop  k  and  write 

Ah[F(h,l),  F(h,2),  ...,  F(h,n)] 
and 
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F(h,j)  =  I  +  w(h,j) 

J 

for  the  h-th  occurrence  of  A  for  Eq.  (2)  of  Section  6.1.2  (the  superscript  is 
used  if  it  is  necessary  to  distinguish  the  different  occurrences  of  A). 
Furthermore  we  assume  that  F(h, j)  ^  0  for  any  h  and  j. 
Now  let  us  study  the  following  two  examples. 
Gl:  for  I  :=  1  step  1  until  10  do 

A[I]  :=  A[I]  +  5; 
G2:   for  I  :=  1  step  1  until  2  do 
for  J  :=  1  step  1  until  10  do 
A[I,J]  :=  A[I-1,J+1]  +  5; 
Assume  that  an  arbitrary  number  of  PE's  are  available.   Then: 
Gl:  All  ten  statements  (A[I]  :-   A[I]  +  5)  can  be  computed  simultaneously 

by  10  PE's. 
G2 :  A[1,J]  and  A[2,J-2]  can  be  computed  simultaneously  by  two  PE's  at  the 
J-th  step  (J=l,2, . ..,10). 

In  what  follows,  the  above  two  types  of  the  interstatement  parallelism 
are  studied. 

Before  we  go  into  the  details,  a  few  comments  are  in  order  with  regard 
to  real  programs.   A  for  statement  with  a  single  body  statement,  (i, ,--.,I  )S, 

can  be  classified  from  several  different  points  of  view.   First  of  all  let  us 
take  a  for  list  set  L..   As  a  simplified  case  we  have  L.  =  (s  .,  s  .+1,  . . .,  t . ) 

(t .  =  (|l.|-1)  +  s.)  which  is  equivalent  to  an  ALGOL  statement  "for  I.  :=  s. 

step  1  until  t.  do".   Knuth  stated  [9]  that  examination  of  published  algorithms 

KJ 

showed  that  well  over  99  percent  of  the  use  of 'the  ALGOL  for  statement,  the 
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value  of  the  step  was  ' +11,  and  in  the  majority  of  the  exceptions  the  step  was 
a  constant.   This  statement  was  confirmed  by  checking  all  Algorithms  published 
in  the  Communications  of  the  ACM  in  1969-   There  were  23  programs  and  263  for 
statements  used.   Only  six  uses  were  exceptions  (z   3  percent). 

Next  let  us  examine  a  body  statement  S.   Then  either  (l)  the  left 
hand  side  variable  of  S  (i.e.  OUT(S))  is  a  single  variable,  or  (2)  OUT(S)  is 
an  array  identifier.   In  case  of  (2)  S  is  of  a  form 

A°[F(0,l),...,F(0,n)]  :=  f  (A^FCl,  l),  .  . .  ,F(l,  n)],  . . .,  AP[  f  (p,  l),  .  . .  ] ) . 

Now  S  has  either  one  of  the  following  five  forms. 

M 

(1)  OUT(S)  is  a  single  variable  t. 

(i)  t  :=  a  function  which  does  not  depend  on  t, 
e.g.  t  :=  a  +  5, 
(ii)  t  :=  f(t),  e.g.  t:=  t  +  a, 

(2)  OUT(S)  is  an  array  variable  A: 

(i)  A  [F  , ...,F  ]  :=  a  function  which  does  not  depend  on  A, 

e.g.  A[I,J]  :=  b  +  5, 
(ii)   for  all  h  F(0,  j)  -  F(h, j)   is  a  constant  for  each  j, 
e.g.  A[I,J]  :=  A[I-5,J+3]  +  A[I+l,J-3]  +  5 
(iii)  other  cases,  e.g.  A[I,J]  :-  A[2I,J-5]  +  a. 
Note  that  if  S  is  of  Form  (l-i),  then 

P1  =  Sfl^ClLj),  L2(|L2|),  ...,  Ln(|Ln|)].   For  example  let  P1  be 


We  use  a  lower  case  letter  for  a  single  variable  and  an  upper  case 
letter  for  an  array  variable. 
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for  I  :=  1  step  1  until  5  do  t  :=  A[I]  -  1. 

Then  after  the  execution  of  r,    t  =  A[  5]  -  1« 

Again  all  Algorithms  published  in  the  CACM  were  checked  (this  time  the 
check  was  made  against  Algorithms  published  in  1968  and  1969. )  There  were  52 
programs  altogether  and  117  for  statements  with  a  single  body  statement.  The 
details  were: 


(1) 


(2) 


No 

.    of  Exam 

pies 

Percentage 

(i) 

0 

0 

(ii) 

1+2 

35.8 

(i) 

18 

15.4 

(ii) 

33 

28.2 

(iii) 

2k 

20.6 

117  100.0 

In  what  follows  we  deal  with  Forms  (2-i),  (2-ii)  and  (2-iii).   Form 
(l-ii)  has  been  discussed  in  Chapter  h. 

6.2.2  Type  1  Parallelism 
6.2.2.1  General  Case 

As  stated  in  Chapter  5>  a  block  of  statements  P  need  not  be  executed 
according  to  the  essential  execution  order  E„  and  may  be  executed  by  any 

execution  order  E  as  long  as  (P, En)=(P, E)  holds.   In  this  section  we  study  a 

special  class  of  execution  orders  called  type  1  execution  orders.   This  execution 
order  is  defined  for  each  loop  index  I  (u=l,2, . . . ,n)  and  hence  there  are  n  of 

these. 
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Definition  1: 


A  type  1  parallel  execution  order  with  respect  to  I  (we  write  1-p 


w.r.t.  I  )  is  given  by 


E(P)  =  {((i(l..n)),V((i(l..u-l),i(u+l..n))|(|L(l..u-l)|, |L(u+l. .n) | )) 
|(l(n))  <  (i(l..n))  <  (|L(l..n)|)), 
and  is  represented  by  E[ I  ] . 

Figures  6.1  and  6.2  illustrate  execution  orders  E~  and  Efl  1. 
D  0      L  uJ 

Note  that  Efl  ]((i(l..n)))  =  E[I  ] ( (if (l. .n) ) )  if  i  =  i*   for  all 

U  li  K.  K. 

k  =  1,2, . . . ,  u-l,u+l, . . . ,n.   Furthermore  note  that  if 

V((i(l..u-l))|(|L(l..u-l)|))  >  V((i'(l..u-l))|(|L(l..u-l)|)), 
then  E[Iu]((i(l..n)))  >  E[ y ( (i» (1. .n) ) ) . 

By  introducing  extra  |L  |  PE's,  the  computation  time  becomes  one 

n  n 


|L  I -th  of  the  original,  i.e.  tt  |L.|  steps  instead  of  it   |L.|  steps,  where  one 
U  j=l  J  j=l  3 

step  corresponds  to  the  computation  of  a  body  statement. 

We  now  introduce  TRANQUIL  notation  [2]  to  illustrate  Definition  1.   In 
TRANQUIL 

for  (I)  sec}  (L)  do  S 
st^oids  for 

for  I  :=  (for  list  set)  do  S. 

Also  in  TRANQUIL  for  (i)  sim  (L)  do  S  indicates  that  statements 
S(L(i))  are  executed  simultaneously  for  all  L(i)  in  L.  Then  Definition  1 
amounts  to  obtaining 
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for  (I  )  se£  (L1)  do 


for  (I  ,  )  sea  ( L  .  )  do 

u-1  — ■*   u-1  — 


for  (I  )  sim  (L  )  do 


for  (Iu+1)  seq,  (Lu+1)  do 


for  (I  )  seq  (L  )  do  S 

v  n  — ■*   n  — 


for  (I  )  se£  (iO  do 


for  (I  )  seq  (L  )  do 

u  — ■*   u  — 


for  (I  )  seq  (L  )  do  S. 

n  — ■*   n  — 

First  we  study  a  type  1  parallel  execution  order  for  a  general  loop 

in  detail.   Let  the  two-dimensional  plane  I  -  I  .  x  •••  X  I  be  an  |L   by 

u    u+1         n       '  u ' 

n 

7T  |L.|  grid  (see  Figure  6.3)« 
j=u+l  D 

The  grid  is  labeled  by  1,2,  ...,i  ,  ...etc.  rather  than  by  L  (l),L  (2), 
...,L  (i  ),...  etc.  for  convenience. 

Note  that  each  square  of  the  plane  represents  the  computation 
S((i(l. .u-l),i(u. .n)))  for  some  (i(l..u-l)).   If  P'  is  executed  by  E  ,  then  the 
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<**♦! 


u\J> 


Figure  6.3*  Conditions  of  Parallel  Computation  in  a  Loop 


computation  proceeds  from  the  leftmost  column  to  the  rightmost  column  while  in  a 
column  the  computation  proceeds  from  the  top  to  the  bottom  sequentially.   On  the 

other  hand  if  P  is  executed  by  E[ I  ]  then  we  proceed  to  compute  from  the  top 

row  to  the  bottom  row  while  we  perform  computation  in  each  row  simultaneously. 
Each  computation  S( (i(l. .u-l), i(u. .n) ) )  uses  inputs  IN(S( (i(l. .n)))  and  updates 
the  output  0UT(S((i(l..n)))).   Then  as  we  studied  in  Chapter  5,  we  have  to  make 
sure  that  the  computation  S((i(l..n)))  (marked  x  in  Figure  6.3)  does  not  receive 
any  data  which  are  to  be  updated  by  the  computation  in  the  region  R,  i.e.  there 
must  be  no  id  such  that 

(id,  (i(l..u-l),i'(u..n)), (i(l..n)))  cDR 
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holds  where  (i'(u..n))  >  (i(u..n))  and  1^  <  i^. 

Similarly  the  computation  S((i(l..n)})  must  not  use  any  data  which  are  to  be 
updated  by  the  computation  in  the  region  Q,  i.e.  there  must  be  no  id'  such  that 

(id\  (i(l..n)),(i(l..u-l),i"(u..n)))  €LR 
holds  where  (i"(u..n))  <  (i(u..n))  and  iu"  >  iu« 
The  above  observation  gives  the  following  theorem. 

Theorem  1: 

Let  E[I  ]  be  a  type  1  parallel  execution  order  w.r.t.  I  .   Then 

M 
(P^Efl  ])=(P1,EQ)  if  and  only  if  there  are  no  id,  id',  (1(1.  .n)),  (i(l..u-l), 

J 

i'(u..n))  and  (i(l. .u-l), i"(u. .n) )  for  which  either 

(1)   (i)   i  '  <  i  and 
'        u     u 

(ii)   (i'(u+l..n))  >  (i(u+l..n))  and 

(iii)   (id,  (i(l. .u-l),i' (u. .n)), (i(l. .n)))  eDR,  where  id  e 

OUT(S((i(l..u-l),i"(u..n))))  and  id  €  IN(S( (i(l. .n) ) ) 

or  (2)   (i)   i  "  >  i  and 
u     u 

(ii)   (i"(u+l..n))  <  (i(u+l..n))  and 

(iii)   (id*,  (i(l..n)), (i(l.. u-l),  i"(u..n)))  eLR  where  id'e 

OUT(S((i(i..u-l),i"(u..n))))  and  id  e  IN(S((i(l. .n))) 

hold. 


T(±' (u+1. .n))  ^  (i(u+l..n)),  for  example,  means  V( (i1 (u+1. .n) ) | B)  > 
V((i(u+1. .n))|B)  where  B  =  ( |L(u+l. .n) | ) .   Unless  specified,  the 
base  ( | L(s. .n) | )  (=( |L  |,...,|L  |))  is  to  be  understood  for  (i(s. .n)) 
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Let  S  be  of  a  form  A  :=  f(A  ,  ...,AP)  where  A  is  an  array  identifier 
and  the  superscript  is  used  to  distinguish  different  occurrences  of  A.  Then 
id  in  the  first  condition  of  Theorem  1  corresponds  to  those  A  [(i(l..n))]  for 

which  A  [(i(l..u-l),i'(u..n))]  =  A  [(i(l..n))]  holds  together  with  the  three 
conditions  (i),  (ii),  and  (iii)  of  (1)  between  (i'(u..n))  and  (i(u..n)). 

Similarly  id'  corresponds  to  those  A  [(i(l..n))]  for  which  A  [(i(l..u-l), 

i' (u. .n))]  =  A  [ (i(l..n))]  holds.  Thus  A  (l  <  h  <  p)  can  be  classified  into 
three  groups : 

CI  =  {h|A  satisfies  the  first  condition} 

C2  =  {h|A  satisfies  the  second  condition) 
C3  =  (1,2,  ...,p)  -  CI  -  C2. 

Note  that  CI  n  C2  =  0. 


Example  2 : 

Let  P1  be 

(Ix-  (1,2,3),  I2-  (1,2,3))(A°[I1,I2]  :=  f(k\ I^Ig+l])). 

Then  for  i^  =  1<  ig  =  2  and  (i2')  =  (3)  >  (ig)  =  (2),  we  have  A°[  (i^i^  )]  = 

A  [(i  ,i  )]  =  A[l,3],  or  (A[l,3],  (1,3),  (2,2))  eDR.  Thus  P  cannot  be  computed 

in  1-p  w.r.t.  I,,  and  CI  =  [1]  . 

From  this  argument  it  should  be  clear  that  if  a  body  statement  is  of 
Form  (2-i)  (see  Section  6.2.1),  then  the  loop  can  be  computed  in  1-p  w.r.t. 
any  Iu(u=l,2,  ...,n). 
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6.2.2.2  A  Restricted  Loop 

If  a  loop  is  a  restricted  loop,  then  Theorem  1  may  be  simplified. 
First  we  define  a  vector  R(h)  for  each  h  =  1,2, .  ..,p: 
R(h)  =  (R1(h),...,Rn(h)), 

where 

R  (h)  =  F(0,j)  -  F(h,j) 

J 

=  I,  +  w(0,j)  -  (I   +  w(h,j)) 

J  J 

=  w(0,j)  -  w(h,j). 
For  example  we  get  R(l)  =  (-1,8)  from  a  statement 

A°[  Irl,I2+3]    :=   f(AX[  I^Ig-5]). 

Then  we  use  these  vectors  to  check  parallel  computability  as  follows. 
Also  for  convenience  we  write 

R'(u,h)  =  (R1(h),...,Ru_1(h)) 

and 

R"(u,h)  =  (Ru+1(h),...,Rn(h)). 

Theorem  2: 

If  one  of  the  following  two  holds  for  any  of  R(h)  (h  =  1,2, ...,p), 

then  P  cannot  he  computed  in  1-p  w.r.t.  I  . 

(1)   (i)  R'(u,h)  =  (0,...,0)  and 

(ii)  R  (h)  >  0  and 
u 

(iii)  R"(u,h)  <  (0,  ...,0)  and  V.(u  +  1  <  j  <  n)  |  R.(h)|<|L.|  -  1. 
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(2)   (i)  R*(u,h)  =  (0, ...,0)  and 
(ii)  R  (h)  <  0  and 

(iii)  R"(u,h)  >  (0,...,0)  and  V  (u  +  1  <  j  <  n)  |  R  (h)|<|L.|  -  1. 

That  the  theorem  is  valid  is  the  direct  consequence  of  Theorem  1,  i.e. 
the  first  check  of  the  theorem  corresponds  to  the  first  condition  of  Theorem  1 
and  the  second  check  corresponds  to  the  second  condition.  For  example  the 
first  condition  of  Theorem  1  says  that  if 

(id,  (i(l..u-l),  i'(u..n)), (i(l..n)))  eDR  holds  f or  i  '  <  i  and 

(i'(u+l..n))  >  (i(u+l..n)),  then  P  cannot  be  computed  in  1-p  w.r.t.  I  ,  where 


u* 


id  e  0UT(S((i(l..u-l),  i'(u..n))) 
and  id  e  IN(S( (i(l. .n) ) ) ) . 
Then  id  represents  the  element  of  A  for  which 

Ah[(i(l..n))]  =  A°[(i(l..u-1),  i'(u..n))] 
holds.   Now  this  implies  that 

F(h,j)(L.(i.))  =  F(0,J)(L  (i  .)) 

J     J  J     J 

for  j  <  u  and 

F(h,d)(L,(i,))  =  F(0,j)(L.(i  ')) 


for  j  >  n.   Hence 


L  (i  )  +  w(h,j)  =  L  (i  )  +  w(0,j) 

J     J  J     J 


or  R.(h)  =  0  for  j  <  u,  and 
J 


or 


L  (i  )  +  w(h,j)  =  L.(i')  +  w(0,d) 

J     J  J     J 


i.'  =  i.  -  R.(h) 

3  J     d 
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for  J  >  u.  Also 

(i'(u+l..n))  >  (i(u+l..n)) 
with 

B  =  (|L(u+l..n)|) 
becomes 

V((iu+1-Ru+1(h),...,in-Rn(h))|B)  >  V((i(u+l..n))|B). 

Then  by  Lemma  1, 

V((Ru+1(h),...,Rn(h))|B)  <  V((0, ...,0)|b). 

Thus  the  first  check  of  Theorem  2  is  verified.   The  second  check  can  be 
varified  similarly. 

Now  let  us  consider  the  number  of  checks  required.  For  each  A  (h=l,  2, 
. ..,p)  which  appears  on  the  right  hand  side  of  S,  we  first  obtain  a  vector  R(h). 
Then  for  each  loop  index  I  ,  we  perform  the  two  checks  given  by  Theorem  2  for 

all  R(h)  (h=l, 2, . . .,p).   Since  there  are  n  loop  indicies,  in  total  we  perform 
2np  checks . 

The  procedure  described  in  this  section  can  be  extended  to  cover 
nonrestricted  loops,  too.   Let  S  be  of  a  form 

A°[F(0,l),...,F(0,n)]  :=  f  (aV(1,  l),  .  . .  ,F(l,n)] ,  .  . . ,  AP[  (F(p,  l),  .  .  .  ]  ) 

and  we  define  a  vector  R(h)  for  each  h  =  1,2, •  ..,p  as  we  did  before,  i.e. 
R(h)  =  (R1(h),  ...,  Rn(h)) 

and 

R  (h)  =  F(0,j)  -  F(h,j). 

J 

Since  a  loop  is  not  restricted,  F(0, j)  and  F(h, j)  may  take  any  form  and  hence  R ^h) 

d 
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may  not  be  a  constant  number  but  rather  a  function  of  loop  indicies,  e.g.  R(h)  = 

I,  +  21,  -  5.  Hence,  in  the  most  general  case,  it  is  necessary  to  check  the 

two  conditions  of  Theorem  2  for  all  values  of  (i(l..n))  (i.e.  (l(n))  <  (i(l..n)) 
<  ( lL-1,  |L  |,  • ..,  |L  | ))  to  examine  type  1  parallel  comput ability,  i.e. 

n 
2(  it   |L.|)  checks  are  required  for  each  R(h)  (h=l, 2, . .  .,p).  In  many  cases,  we 

d=i  J 

can  expect  that  the  number  of  checks  required  is  far  smaller  than  that.   For 
example  if 

R(D  =  (21^21^,1^21^), 
then  only  2( | L, |x|L, | )  checks  are  required,  i.e.  it  is  not  necessary  to  check 
for  those  loop  indicies,  e.g.  1^,   which  do  not  appear  in  R.(l)  (d"l,  2,3**0. 

6.2.2.3  Temporary  Locations 

In  this  section  we  mean  a  restricted  loop  by  a  loop.   The  second 
condition  of  Theorem  1  (or  2)  may  be  dropped  by  introducing  extra  temporary 

locations  by  applying  Transformation  T  of  Chapter  5  on  P  ,  i.e.  if  CI  =  0  and 

and  C2  /  0,  then  temporary  locations  may  be  set  up  so  that  P  can  be  computed 
in  parallel  (for  CI  and  C2,  see  Section  6.2.2.1).   Let  heC2.   This  implies  that 
there  are  (i(l..n)),  (i(l. .u-1), iT (u. .n) )  and  id  (see  Figure  6.^)  for  which 

(id,  (i(l..n)),(i(l..u-l),i'(u..n)))  eLR 
holds  and 

id  =  Ah[(i(l..n))]  €  IN(S((i(l..n)))) 
and 
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id  =  A°[(i(l..u-l),i'(u..n))]  e  OUT(S( (i(l. .u-l), i ' (u. .n)) )) ) . 
If  a  loop  is  confuted  in  1-p  w.r.t.  I  ,  then  we  have 

E[Iu]((i(l..u-l),i'(u..n)))  <  E[Iu]((i(l..n))) 

while 

E0((i(l..u-l),i'(u..n)))  >  E0((i(l..n))). 

Hence  A  [(i(l..n))]  will  be  updated  "by  S(  (i(l.  .u-l),  i' (u.  .n) ) )  before  being 
used  by  S( (i(l. .n)) ).   That  is,  if  we  compute  P  in  1-p  w.r.t.  I  we  must  keep 

the  old  value  of  A  [(i(l..n))]  which  otherwise  will  be  updated  by  S( (i(l. .u-l), 
i'(u..n)))  at  the  E[I  ] ((i(l. .u-l), i* (u. .n) ) )-th  step  separately  until  it  is 

used  by  S((i(l..n)))  at  the  E[ I  ] ( (i(l. .n)))-th  step.   The  period  of  time,  t  , 

through  which  the  old  value  of  A  [(i(l..n))]  must  be  kept  for  the  computation 
S((i(l..n)))  is  given  by 

th  =  E[Iu]((i(l..n)))  -  E[Iu]((i(l..u-l),i'(u..n))) 

=  V((i(u+l..n))|B)  -  V((i'(u+l..n))|B), 
where  B  =   |L(u+l. .n) | ).   Then  as  we  showed  in  Section  6.2.2.2,  in  case  of  a 
restricted  loop,  we  can  show  that 

n 

t     =  V(R"(u,h)        B)   +       Z    B     -  1 
h  ,  t    s 

s=u+l 

n 

where  B     =     ir   I L ,    .  I  .      The  details   are  omitted. 
s  '    t+1 ' 

t=s 
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Figure  G.k.      An  Illustration  of  t 


Now  max  t  gives  the  maximum  period  of  time  through  which  A  [(i(l..n))]  must  be 
h€C2 

kept.   Since  we  have  | L  |  of  them  (i.e.  |L  |  statements  are  computed 


simultaneously),  the  total  amount  of  temporary  locations  required  will  he  |L  |  x 


u' 


max  t,  .  Additional  |L  |  locations  are  required  for  buffering  (see  Example  5)« 


u 


Hence  we  have  the  following  theorem. 


Theorem  3: 


The  maximum  number  of  temporary  storage  locations  required  is 


L  |  x  (max  [V((R    (h),...,R  (h))|B)  +   Z     B  ] ) 
heC2      U+1        n         s=u+l  S 


where  B  =  ( |L(u+l. .n) | )  and  B  =  w   |L ,  ,  |  and  B  =  1. 

t=s 
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Example  3: 

Let  P1  be 

for   (I   )    se£  (1,2,  ...,U0)   do 

for   (I    )    se£  (1,2,  ...,1j-0)    do 

A[IX,I2]    :=  A[I1+2,I2-3]    +  2; 

P  as  it  is  cannot  "be  computed  in  1-p  w.r.t.  I  because  it  violates 

the  second  condition  of  Theorem  2.   Now  we  modify  P  as  follows  by  introducing 
temporary  arrays  Tl(UOxl)  and  T2 (1*0x3). 
for  (I  )  se£  (1,2,. ..,^0)  do 

for  (I  )  se£  (1,2,  ,..,kQ)   do 

begin  SI:      T1[I1]    :=  Afl^Ig] ; 

S2:     k[lvl2]    :=  T2[I]_,I2  mod  3]    +2;^ 

S3:      T2[I1,I2  mod  3]    :=   Tl[  Ij  ; 

end. 

Then  all  three  statements  can  be  computed  in  1-p  w.r.t.  I  ,  i.e.  we 

can  replace  seq  in  the  first  for  statement  by  sim.   The  original  P  ,  if 

executed  sequentially,  takes  1600  steps  whereas  the  modified  P  takes  only  120 
steps  if  executed  in  parallel  with  respect  to  I  . 


JL 

"a  mod  b  =  a 


Also  we  assume  that  T2  is  properly  initialized  before  the  computation 
of  the  loop,  i.e.  store  A[l,*],  A[2,*]  and  A[3,*]  in  T2[l,*],  T[2,*] 
and  T[0,  *]. 
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6.2.3  Type  2  Parallelism 

In  this  section  we  mean  a  restricted  loop  by  a  loop.  Since  the 
conflict  between  two  statements  S(i)  and  S(j)  due  to  the  existence  of  an 
identifier  id  such  that  (id,  i,  j)  eLR  may  be  resolved  by  introducing  temporary 
locations  (see  Chapter  5  and  the  previous  section),  such  conflict  will  not  be 
taken  into  account  to  check  parallel  computability  throughout  the  rest  of  this 
chapter. 

This  section  describes  the  second  type  of  parallelism,  i.e.  type  2 
parallelism,  in  a  for  statement  with  a  single  body  statement.   Type  2 
parallelism  is  introduced  to  resolve  the  conflict  due  to  the  first  condition 
of  Theorem  1.   The  following  example  illustrates  it. 

Example  k: 

P:   for  I   :=  1  step  1  until  kO   do 

for  I0  :=  1  step  1  until  kO   do 

A°[I1,I2]  :=  a\  1^1,12+1]  +  A2[I1,I2-1]; 

Since  R1(l)  a  1  >  0  and  (Rp(l))  =  (-1)  <  (0)  hold,  P  cannot  be 
computed  in  1-p  w.r.t.  I  . 

Now  let  us  consider  the  I  -I  plane  (Figure  6.5).   Suppose  that  all 

S((i, ,i  ))  in  the  shaded  area  have  been  computed.   Then  at  the  next  step  those 

S((i  * ,  i  ' ) )  marked  as  HJ  can  be  computed  simultaneously,  and  at  the  following 

step  all  (2)  can  be  computed  simultaneously,  and  so  forth.   We  can  see  that  a 
heavy  zigzag  line  travels  from  left  to  right  like  a  "wave  front"  indicating 
that  all  statements  on  that  front  can  be  computed  simultaneously. 
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Figure  6.5-  Wave  Front 

Note  that  computation  of  P  by  this  scheme  takes  approximately  120 
steps,  while  if  P  is  computed  sequentially  it  takes  k-0   x  h-0   =  1600  steps. 

Given  a  loop  P  ,  if  P  is  computed  in  1-p  w.r.t.  I  ,  then  a  "wave 

front"  is  in  parallel  with  the  I  axis  of  the  I  -  I  ,  x  •••  X  I  plane,  and 
^  u  u    u+1         n  *         ' 

it  travels  in  the  increasing  order  of  (i(u+l..n)). 

If  P  cannot  be  computed  in  1-p  w.r.t.  I  then  it  may  be  possible  to 
find  a  "wave  front"  which  is  diagonal  rather  than  horizontal  as  in  Example  k   on 


the  I  -  I  -  x  ...  x  I  plane, 
u    u+1         n  ^ 
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The  direction  of  wave 
front  travel 


tan  a  =   slope  of  a  wave  front 


Figure  6.6.  Wave  Front  Travel 


This  wave  front  is  such  that  all  computations  S((i(l..n)))  which  corresponds  to 
points  (i  ,i  ,,...,  i  )  which  lie  right  next  to  a  wave  front  can  be  computed 

simultaneously.   In  other  words  all  necessary  data  to  compute  S( (i(l. .u-l), 
i(u..n)))  have  been  already  computed  in  the  shaded  area.   The  direction  of  a 
wave  front's  travel  is  perpendicular  to  the  wave  front. 

Now  let  us  obtain  the  slope  of  a  possible  wave  front  for  a  restricted 
loop. 

Let  P  be  a  restricted  loop.   Assume  that  P  cannot  be  computed  in 
1-p  w.r.t.  I  .   Then  according  to  Theorem  2,    this  means  that  there  are  R(h)  for 

which  R  (h)  >  0  and  (R   (h),  . .  ,,R  (h))  <  (0, ...,0)  hold  (i.e.  CI  /  0). 
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Theorem  k: 

The  slope  of  a  possible  wave  front  in  the  I  -  I  ,  x  •••  x  I  plane 


is  given  by 


max 
he  CI 


^  ((-V((Ru+1(h),...,Rn(h))|B)  -   r  Bs  +2)1 
u  s=u+l       -1 


n 

where  B  =  (  lL(u+l.  .n)  |  )  and  B  =  7r  L,  ,.   and  B  =  1. 

s       t+1       n 
t=s 


In  Example  k,    the  slope  of  the  wave  front  is  ■r(-("l"l+l)-l+2)  *  2. 

Proof  of  Theorem  k: 

Let  us  consider  S( (i(l. .u-l), l(u. .n) ) )  on  the  I  -  I    x  ...  x  I 

u    u+l         n 

plane.   Assume  that  there  is  a  variable  id  such  that  (Figure  6.7) 

(id, (i(l..u-l),i'(u..n)),(i(l..u-l),i(u..n)))  eDR 

holds  together  with 

i  '  <  i  and  (i"(u+l..n))  >  (i(u+l..n)), 
u     u  — 

i.e. 

id  e  IN(S((i(l..u-l),i(u..n)))) 
and 

id  e  OUT(S((i(l..u-l),i'(u..n)))). 
This  implies  that  there  is  he  CI  such  that 

Ah[(i(l..u-l),i(u..n))]  =  A°[(i(l..u-l),i'(u..n))] 
holds. 


l£k 


(i(u+l..n)) 


(i'(u+l..n)) 


1      1                T 

<    "T" 

t 

/ 

h 

// 

/ 

7\a 

,  \L. 

_x 

k Ru(h)— ■* 

Figure  6.7.  An  Illustration  for  Theorem  k 


In  case  of  a  restricted  loop,  we  have 


i  '  =  1 


lUh) 


for  j  >  u.   Now  let 

t,  =  V((i'(u+l..n))|B)  -  V((i(u+l..n))|B) 


then  we  get 


n 


t.  =  -V((R   (h),...,R  (h))|B)  -   Z  B  +  1  where  Be 
n        u+1         n  .  s  s 

s=u+l 


n 


7T  |L,   I  and 


t=s 


t+1 


B  -  1.   Now  if  we  let  the  slope  of  a  wave  front  be  equal  to 
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t  +  l    t,  +  1 

h        h 

i  -  i  '  "  R  (h) 

u    u     u 

then  A  [ (i(l. .n))]  and  A  [ (i(l. .U-l), i' (u. .n))]  will  be  separated  by  it 
(Figure  6.7). 

The  actual  wave  front  is  a  zigzag  line,  rather  than  a  straight  line 
as  shown  in  Figure  6.7- 

If  there  are  more  than  one  h  in  CI,  then  we  choose  a  to  be  large 

enough  so  that  all  inputs  to  S( (i(l. .u-1),  i(u. .n)) )  be  inside  of  a  wave  front, 

i.e. 

t  +  1 
tan  3  =  max  ^-^p. 
h£Cl  u 

(Q.E.D.) 

Now  suppose  we  compute  F  in  parallel  w.r.t.  I  using  a  diagonal 
wave  front  whose  slope  is  D  =  tan  q>   Then  how  many  steps  (one  step  corresponds 
to  the  computation  of  a  body  statement  S)  does  it  take  to  compute  P  ? 

Theorem  c/\ 

The  total  number  of  steps  required  to  compute  P  in  parallel  w.r.t. 
I  using  a  diagonal  wave  front  whose  slope  is  D  is  given  by 


/u-1         \ 

i     n 

|L    |D^ 

1    u ' 

i 

T     - 
P 

'     TT     |L.| 

'.j=u+l     J 

+ 
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Proof: 


Let  us  consider  the  I  -  I  ,,  x  ...  x  I  plane. 

u    u+i         n 


end 


Figure  6.8.   An  Execution  by  a  Wave  Front 


Wave  front  W  must  travel  from  the  start  position  to  the  end  position  on  the 

n 
plane.   How  long  does  it  take?  It  takes  L  +  |L  |d  steps  where  L  =  it     |L. 

U  j=u+l  3 

u-1 

Since  we  have  to  process  7rlL.ll  -  I  ,-■  x  •••  X  I  planes,  in  total  it 

^       ._,  '  j  '   u    u+1         n  *  ' 

u-1 
becomes  T  =  tt  |L .  I  (L  +  |L  |D). 


(Q.E.D.) 


Note  that  if  a  wave  front  is  horizontal  (i.e.  if  P  can  be  computed 


in  1-p  w.r.t.  I  ),  then  D  =  0  and  T  =  tt  IL.I. 
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6.2.4  Conclusion 

Assume  that  there  are  an  arbitrary  number  of  PE's  available.   Given 
a  restricted  single  body  statement  loop, 

P1  =  (lr...,In)(A0rF°,...,^]  :=  f(A1[F^,...,Fj],...,AP[FP,...>^])) 

we  can  check  if  F  can  be  computed  in  1-p  w.r.t.  I  (u=l, ...,n)  by  Theorem  2. 

If  it  cannot  be,  then  we  can  check  for  type  2  parallel  computability  w.r.t. 

I  ,  i.e.  find  a  possible  wave  front.  In  either  case  we  obtain  the  number  of 

u' 

n  u-1       n 

computational  steps  required,  i.e.  T  =  ir  |L.|  or  T  =  (  tt  |L.|)(  ir   |L.|  + 

U   j=l  3  u    j=l  3       j=u+l  J 

lL  I'D),  where  one  step  corresponds  to  the  computation  of  the  body  statement  S. 

Then  among  all  possible  choices,  we  would  choose  to  compute  in  parallel  w.r.t. 
I  where  T  =  min  T . - 

6.3  A  Loop  With  Many  Body  Statements 

6.3-1  Introduction 

In  what  follows  we  mean  a  restricted  loop  by  a  loop.  Again  a  check 
against  all  published  Algorithms  in  1968  and  1969  CACM  issues  has  been  done, 
and  it  has  been  revealed  that  well  over  50  percent  of  the  cases  of  for  statement 
usage  (with  more  than  two-body  statements)  are  instances  of  restricted  loops. 

Also  as  stated  in  Section  6.2.2.3  and  Chapter  5,  the  LR  relation  may 
be  disregarded  by  introducing  temporary  locations.  Hence  it  will  not  be  taken 
into  account  throughout  the  rest  of  this  chapter. 
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Given  a  loop  with  m  body  statements,  P  ,  there  are  three  different 
approaches  to  compute  it  in  parallel.  First  it  is  possible  to  extend  the 
procedure  described  in  Section  6.2  by  treating  m  body  statements  as  if  they 
were  one  statement.  That  is,  we  consider  body  statements  as  a  function 

OUT(S(i(l..n)))  =  f(lN(S((i(l..n)))). 
For  example 

SI:     A[I,J]    :=   f(A[I,J-l],B[I-l,J-l]); 
S2:      B[I,J]    :=   g(A[I-l,J-l],B[I,J-l]) 
yield 

S:      {A[I,J],B[I,J]}    :=   f(A[I,J-l],A[I-l,J-l],B[I-l,J-l],B[I,J-l]). 
Then  we   can  apply  Theorem  1  directly  to  check  if  e.g.    S  can  be  computed  in   1-p 
w.r.t.    I. 

The  second  and  the  third  approaches  can  be  illustrated  by  the  follow- 
ing two  examples. 

El:      for  I    :=   1   step  1  until  kO  do 
begin 

SI:     A[I]    :=   f(A[I],B[I]); 
S2:      B[I]    :=  g(A[  I],B[  1-1] ); 
end; 
E2:      for   I    :=   1   step  1  until  i+0  do 
begin 

SI:      A[I]    :=   f(A[I-l],B[I-2]); 
S2:      B[I]    :=  g(A[I]); 
end. 
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In  El,  note  that  SI  and  S2  cannot  be  computed  in  parallel  for  all  values  of  I 
because  S2  has  an  iteration  form 

B[I]  :=  g'(B[I-l]). 
However,  El  may  be  replaced  with  two  for  statements : 

for  I  :=  1  step  1  until  kO   do 
SI:  A[I]  :=  f(A[I],B[I]); 

for  I  :=  1  step  1  until  kO   do 

S2:   B[I]  :=  g(A[  I],  B[  1-1] ) ; 
Now  the  first  loop  can  be  computed  in  parallel  for  all  values  of  I  while  the 
second  for  statement  is  still  an  iteration.   In  general  by  replacing  a  single 
or  statement  with  two  or  more  for  statements  the  parallel  part  may  be  exposed. 

In  the  second  example,  SI  and  S2  can  not  be  computed  in  parallel 
for  all  values  of  I,  nor  can  they  be  separated  into  two  independent  loops 
because  SI  uses  values  which  are  updated  by  S2  (i.e.  B[I-2]),  and  S2  uses 
values  being  updated  by  SI  (i.e.  A[I]).   However  SI  and  S2  could  be  computed 

simultaneously  while  I  varies  sequentially  if  the  index  expression  in  S2  is 

A 

"skewed"  as  follows. 

E2'  for  I  :=  1  step  1  until  kO   do 
begin 

SI:      Ari]    :=   f(A[I-l],B[I-2]); 
S2»:      B[I-1]    :=   g(A[I-l])j 
end. 


JL 

Strictly  speaking  S2'    should  not  be   executed  when  1=1  and  an  extra 

statement  S2" :      Bl^O]    :=  g(  k[kO])   is   required  after  this  loop.      For 
the  sake  of  brevity  those  minor  boundary  effects  are  ignored  through- 
out this  section. 
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Figure  6.9  illustrates  the  computation  of  the  modified  loop  as  well  as  the 
original  loop. 


I 

SI 

S2 

• 
* 

i-2 

B[l-2] 

1-1 

A[ 1-1] 

1      / 

/ 

i 

1  / 
A[i]- 

-»B[i] 

i-2 


1-1 


SI 


A[i-1] 
^ 


A[i] 


S2' 


^ 


B[i-2] 


B[i-1] 


Figure  6.9.   Simultaneous  Execution  of  Body  Statements 

In  general,  the  above  three  approaches  could  be  tried  in  any 
combination.   For  example,  we  may  first  try  the  first  approach,  i.e.  we  try  to 
execute  body  statements  simultaneously  for  all  values  of  some  loop  index.   If 
this  Tails,  then  we  may  use  the  second  approach,  i.e.  we  separate  a  loop  or 
we  replace  a  loop  with  as  many  for  statements  as  possible.   On  a  resultant 
for  statement  we  again  try  the  first  approach  (if  it  has  only  one  body  statement, 
then  the  results  of  the  previous  section  can  be  used).   If  we  fail  again,  then 
the  third  approach  can  be  taken. 

We  now  describe  each  approach  separately. 

Before  we  go  further,  we  define  the  following  notations.  Without 

loss  of  generality  we  assume  that  the  p-th  occurrence  of  A,  appears  in  S  and 

k  p 


also  assume  that  S  and  S  have  forms 
P     q 
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and 


S  *   (AP[F(k,p,l),...,F(k,p,n)]  :=  f _.(...)) 


S  =  (-..  :=  f(...,A*[F(k,q,l),...,F(k,q,n)], ...))• 


q        q 

Then  we  define  a  vector  R(k,p,q)  as  follows 

R(k,p,q)  =  (R1(k,p,q),...,Rn(k,p,q)) 

where 

R,(k,p,q)  -  F(k,p,i)  -  F(k,q,j). 

J 

-  w(k,p,j)  -  w(k,q,j). 
If  F(k,p,ci)  =  F(k,  q,  j)  =  0,  then  we  let  R.(k,p,  q)  =  0.   Finally  we  write 

J 

R'(u,k,p,q)  =  (Rx(k,p,q),  ...,Ru_1(k,p,q)) 

and 

R"(u,k,p,q)  =  (Ru(k,p,q),...,Rn(k,p,q)). 

6.3*2  Parallel  Computation  with  Respect  to  a  Loop  Index 

We  first  study  the  first  approach  described  in  the  previous  section, 
i.e.  we  treat  body  statements  as  if  they  were  one  statement  and  try  to  execute 
them  in  parallel  with  respect  to  some  loop  index. 

Let  us  consider  P  =  (I. ,  I_,  ....  I  )  (S.,  ; . . .  :S  ) .   Then  we  treat  m  body 

12    '  n   1      m  J 

statements  as  one  statement  S  where 


m 


0UT(S((i1,i2,...,in)))  =  U  OUT(S  (dri2,  .••,!))) 

P=l 


and 
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m 
IN(S((iri2,...,in)))  =  IN(S1((i1,i2,...,in)))  U  U  [IN(S  ((i^ig, 

p-1 
...,in)))  -  U  OUTCS^dpig,...,^)))]. 

Having  these  two  sets,  we  can  use  results  of  Section  6.3  directly.  For  example 
let  us  consider  Theorem  2.  Then  we  may  modify  Theorem  2  as  follows.  First 

suppose  an  array  A£  appears  in  OUT(S( (i(l. .n) ) ) )  and  A^  appears  in 

IN(3((i(l..n)))).   Then  obtain  R(k,p,q). 

Theorem  6;   (cf.  Theorem  2) 

For  each  A^  in  0UT(S( (i(l. .n) ) ) ),  we  obtain  R(k,p,  q)  for  all  q  such 

that  A^  is  in  IN(S( (i(l. .n) ) ) ).   Then  if  there  is  any  R(k,p, q)  which  satisfies 

all  three  conditions  described  below,  then  S  cannot  be  computed  in  type  1 
parallel  w.r.t.  I  .   Conditions: 

(1)  R.(k,p,q)  =  0  or  0  for  all  j  =  1,2,  ...,u-l, 

J 

(2)  Ru(k,p,q)  >  0,  and 

(3)  there  is  £(u+l<£<n)  such  that  V  .(u+l<j<£-l),  R  .(k,p,  q)  =  0  and 

J  J 

R/k,p,q)  =  0  or  R£(k,p,q)  <  0. 

The  above  theorem  can  be  proved  similarly  to  Theorem  2,  and  the 
details  will  not  be  given. 

Since  we  have  to  apply  the  above  check  for  all  R(k,p,q)  vectors,  the 
number  of  checks  required  is  proportioned  to  the  total  number  of  R(k,p, q)  vectors, 
#R(k,p, q).  Also  since  we  can  try  to  compute  S  in  type  1  parallel  in  n  ways,  i.e. 
with  respect  to  I  (u=l,2, . . .,n),  the  total  number  of  checks  we  would  perform  is 
given  by  n  x  (#R(k,p,  q)). 
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6. J. 3  Separation  of  a  Loop 
6.3-3-1  Introduction 

In  this  section  we  study  replacement  (or  separation)  of  a  single  for 
statement  with  two  or  more  for  statements.  Let 

7®  =  (l1,I2,...,In)(S1;S2;...;Sm) 

=  (l1,...,Iu_1){(lu,...,In)(S1;...;Sm)] 

=  (I,,..., I  JF*. 

1'   '  u-1  u 

For  fixed  values  of  I,,...,  I  , ,  let  us  consider  P^. 

1'    '  u-1'  u 


Figure  6.10.  Execution  of  F^ 

u 


Yjk 


If  we  write  down  the  primitive  execution  order  E0(p),  it  can  be  represented 

by  the  straight  line  as  shown  in  Figure  6.10.  Now  let  us  consider  a  statement 
S((i(l. .n), q) )  (See  Figure  6.10).   If  it  does  not  receive  any  computed  results 
from  part  A,  i.e.  if  there  is  no  id  and  a  statement  S(  (i(l.  .u-l),i'(u.  .n),p)) 
for  which 

(id,(i(l..u-l),i'(u..n),p),(i(l..n),q))eDR 
and  (i'(u..n),p)  <  (i(u..n),q) 
hold,  then  it  may  be  computed  independently  of  part  A.   If  this  holds  for  all 
p  and  all  (i(u..n))  for  some  fixed  q,  then  we  can  compute  S  before  any  S  ,  i.e. 

M 

P111  ~  (I  ,...,1  )(S  );(I  ,...,1  )(S.;...;S   -S  .;...;S  ). 
u     u      n   q    u'     n   1'    '  q-1'  q+1'    '  m7 

Similarly  if  S( (i(l. .n), q) )  does  not  give  any  output  to  part  B,  then  we  have 

M 

P111  "  (I  ,...,1   )(S.;...;S  ,;S  _;...;S  );(l  ,...,1   )(S  ). 
u   v  u'    '    n   1'     q-1'  q+1'    '  m  '  u*    '  n'      q_' 


6. 3« 3*2  The  Ordering  Relation  (6   )  and  Separation  of  a  Loop 

Now  we  study  how  P  may  be  replaced  with  several  for  statements.  We 

first  define  the  relation  6     between  body  statements.   The  relation  is  such 

u 

that  if  B   (p, q)  holds,  then  for  any  given  (i(u..n))  there  are  (i'(u..n))  and  id 

such  that 

(id,  (i(l..u-l),i'(u..n),p),(i(l..n),q))eDR 
and  (i*(u..n))  <  (i(u..n)) 
hold.   That  is,  for  some  fixed  q,  if  there  is  no  p  for  which  B   (p,  q)  holds,  then 


u 


S  can  be  computed  before  any  S  and 


175 


where 


P^-(l)(Sq);(l)(S1;...;Sq_1;Sq+1;...;Sm) 


(I)  =  (I  ,...,1    ). 


Definition  2: 


Between  two  body  statements,  S  and  S  we  first  obtain  R(k,p,  q).  Then 
9   (p,  q)  holds  if  and  only  if 

(1)   R.(k,p,  q)  =  0  or  0  for  all  J  =  1,2,  ...,u-l. 

J 

and  (2)  the  first  nonzero  element  of  R"(u,k,  p,  q)  is  either  0   or  a  positive 
number.   Also  if  all  elements  of  R"(u,k,p,  q)  are  0  then  p  <  q 
holds. 
We  also  write  9     =   { (p, q)  |  9   (p, q)  holds}. 

If  A,  appears  more  than  twice  in  S  ,  then  we  modify  Definition  2  as 

follows.   Suppose  A,  appears  twice  in  S  as  q,  -th  and  q^-th  occurrences.   Then 

we  construct  two  vectors  R(k,p, q, )  and  R(k,p,q_).   For  each  vector  we  check  the 

above  two  conditions.   If  at  least  one  of  two  vectors  satisfies  the  two 
conditions,  then  we  let  9   (p,  q)  hold. 


u 


Example  6 ; 


SI:  A^-1,1,,+3]    :=A2[I2,I3]    +  A^[  1^1] ; 

S2:  A2[I2+1,I3-1]    :=  A1[I1,I2-3]    +^[1^1^; 

S3:  ^[1^5,1^-1]    :-  A^Lylj]    +  Aj;  I^^+l] ; 

S^:  A^LJ    :  =  A^I^l]    +  A^I^I^]; 
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give 

R(l,l,2)  =  (-1,6,0), R(l,.l,3)  =  (-1,2,0),  . 

R(2,2,l)  =  (0,1,-1),  R(2,2,*0  =  (0,1,-1), 

R(3,3,2)  =  (-5,0,-1)  andR(^Al)  =  (1,0,0). 
Then  we  have  0  =  { (2,1),  (2,k),  (k-,1)) . 

Now  let  us  study  Definition  2  in  detail.  First  we  note  that  id  in 
Figure  6.10  corresponds  to  those  A,  for  which 

F(k,p,j)(i  )  =  F(k,q,j)(i  ) 

J  J 

holds  for  all  j  =  1,2, . ..,u-l  and 

F(k,p,j)(i  •)  =  F(k,q,j)(i  ) 

holds  for  all  j  =  u,u+l, ...,n.  The  former  implies  that 

F(k,p,j)  =   F(k,q,j) 
or 

R,(k,p,q)  =  0  (or  0) 

J 

for  j  -   1, 2, ...,u-l  which  is  equivalent  to  the  first  condition  of  Definition  2. 
Next  note  that  (i'(u..n),p)  <  (i(u..n),q)  implies  that  either 
(i)   (i'(u..n))  <  (i(u..n)) 
or  (ii)   (i'(u..n))  =  (i(u..n))  and  p  <  q. 
In  the  first  case 

(F(k,hP,u),...,F(k,hP,n))  >  (F(k,hq,u),...,F(k,hq,n)) 
must  hold-   In  the  second  case 

(F(k,hP,u),...,F(k,hP,n))  =  (F(k,hq,u),...,F(k,hq,n)) 
must  hold.   These  two  make  up  the  second  condition. 
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From  0     we  can  construct  a  dependence  graph  D  with  m  nodes  each  of 
which  represents  a  body  statement,  e.g.  from  Example  6  we  get: 


In  Du  we  call  a  series  of  ©u,  ^(p^Pg),  ©u(p2,P5),  . .  .^(p^p^),  . . ., 

©  (pk  TtV^)   a  chain  and  write  ch(p., p,  )  for  it.  If  p,  =  p..  then  it  is  called  a 

mesh  M.  We  say  that  anode  p.  is  in  the  chain  ch(p  ,p  )  (or  in  the  mesh  M),  or 

the  chain  ch(p, ,p,  )  (or  the  mesh  M)  includes  p..   Note  that  for  nodes  p  and  q 

there  may  be  more  than  one  chain  which  connects  p  to  q. 
Now  let 
Z  =  (p  |  there  is  no  q  such  that  6   (q, p)  holds) 

and 

Z  =  (p  |  there  is  no  q  such  that  6   (p, q)  holds}. 

Furthermore  let 

PD(p)  =  {q  |  ch(q,p)  exists}  U  (p} 
and 

SC(p)  -  (q  |  ch(p,q)  exists}  U  {p}  • 

(PD  for  predecessors  and  SC  for  successors).   Then  we  classify  nodes  in  D  as 
_  u 

follows : 

Z3_  =  (P  |  For  all  r  e  PD(p),  there  is  no  mesh  in  D  which  includes  r} , 

Z  =  {p  I  For  all  r  e  SC(p),  there  is  no  mesh  in  D  which  includes  r)  , 
j  u 

and 
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Z2  =  N  -  Z1   -  Z3- 

Let  Z1(or  Z  )  =  (p^Pg,  .  ..,p  }  •  Then  we  can  order  this  set  as  p  '  p  ',...,  p  ' 

M 

in  such  a  way  that  6   (p.', p.')  does  not  hold  if  i  >  j.      Let  us  write 

9  9 

Z.(or  Z,)  for  a  resultant  ordered  set.  Also  we  order  Z.  =  {q.,q_, . .  .,a  }  as 

Q 

L2   =  ^i''^'*  '••>CV  )  in  such  a  way  that  ^i'  <  qi'  if  i  <  J- 
Now  given  a  loop  t     where 

F    =    (I^,Ig,  . . .,  In) (S^jSgj . . . ;S    ; 

■  (ir---,iu.1)UV.-.,in)(s1;S2;...;sm)) 
=  (in,...,i    J?*, 

1  u-l     u 

we  build  the  dependence  graph  D     and  obtain  sets   Z  ,  Z     and  Z,,    say  Z.    = 

{P1*P2>'"*PU^    z2  =    ( <!■]_»  12' •'•' ^    andZ3  ■   tri>r2>  "^rw^  "      (m=u+v+w)- 

6      6       6 

From  Z  and  Z,  we  obtain  ordered  sets  Z  and  Z,,  say  Z  =  (p  ',pp', . . .  ,p  ') 

6  6 

and  Z^  =  (r1l,P2,,...,Pw')-   A1so  we  have  Zg  =  (q^  ,  q^*,  . . .,  q^'  ) .   Then 

M 
p^~  (i)(s  ,);(i)(s  ,);...;I(S  ,);(i)(s  ,;...;S  ,); 

^1        y2  *u        41       ^v 

(l)(Sr  ,);(I)(S  ,);...j(l)(S  ,) 

12  w 

where  (i)  =  (I  ,1  .,...,1  ). 
u'  u+1     n 


Note  that  Z  (or  Z,)  together  with  9  makes  a  graph  which  does  not 
contain  any  mesh.  To  order  Z  (or  Z,)  the  technique  discussed  in 
Chapter  7  niay  be  used. 
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Thus  we  have  replaced  a  loop  P  with  as  many  for  statements  as  possible.  We 
say  that  p  is  separable  from  !F  if  peZ  (or  peZ_)  with  respect  to  u,  and  that 
F    is  separable  with  respect  to  I  .  Also  we  say  that  p  is  separated  with 
respect  to  I  if  P°J  is  replaced  by  many  for  statements  as  we  showed  above. 

6.3-3.3  Temporary  Storage 

Now  let  us  study  the  following : 

2 
P  :  for  I   :=  1  step  1  until  1  do 

for  I   :  =  1  step  1  until  kO   do 

begin 

SI:  A^y  :=  Ag[I2]  +  A^Ig]; 

S2:  Ak[I2]    :=  A.J  Ij  +A^[Ig]; 

end 
Then  we  have  R(l,l,2)  =  (0,0)  or  ©  (1,2)  holds  and: 

V 

© <D 

Hence  we  get  Z,  =  (1,2)  and 

P1*2:   (I1,I2)(S1:  A^IJ  :=  A2[I2]  +  A3[I2]); 
(I1,I2)(S2:  A4[I2]  :=  A^IJ  +A^[I2]). 
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This,   however,    does  not  give  the  same  results  as  produced  by  the  original  P^. 
Note  that  after  the  execution  of  the  first  loop  (I,, I_)(Sl),    the  only  outcome 

is  A[IX]    (i.e.  A[l])  =  A2[>0]   +  AJlfO] .     However  the  second  loop,    (I  ,1  p)(S2), 

requires  forty  different  inputs,    i.e.  Ap[l]    +  A,[l], .  ,,A,JkO]   +  A,[^0].     Hence 

2 
it  becomes  necessary  to  modify  P  '      as  follows: 

Pl2;     (\^2^S1:     ^l'V    :=  A2[I2]    +  A3[I2] ) ; 

(I1,I2)(S2:     AJI2]    :=  AJI^y    ♦  A^]). 


>A[1] 

)A[1] 


Figure  6.11.  An  Introduction  of  Temporary  Locations 


In  general  we  apply  the  following  transformation  rule  on  a  loop  when 
it  is  separated.  Assume  that  S  and  S  are  body  statements  in  a  loop  F,    and 

6   (p,  q)  holds.  Further  assume  that  p  is  separated  from  TT  with  respect  to 
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I  (i.e.  peZ.,qeZ  ).  Now  let  us  consider  the  vector  R"(u, k,p, q) .  Let  the 

value  of  the  first  element  which  is  neither  0  nor  0  be  e  and  its  position  he 
i.  Then  we  let 

R(u,k,p,q)  =  (J  |  u  <  j  <  i  and  R,(k,p,  q)  =  0} . 

We  order  elements  of  R(u,k,p,  q)  by  their  positions  in  R"(u,k,p, q)  and  write 
R(u,k,p,q)  =  (r(l),r(2),...,r(t)). 

Then  we  apply  the  following  on  the  loop  F . 

Transformation  Tp : 

Transformation  T  is  defined  for  the  cases  e  <  0  and  e  >  0 
separately. 

(1)  e  >  0. 

Change  F(k,p,r(j))  and  F(k,q,r(j))  to  Iy/j\  for  j  =  1,2,  .  ..,t. 

(2)  e  <  0. 

(i)  Change  F(k,p,r(j))  to  1,/j)  for  j  =  1,  ...,t. 

(ii)  Change  F(k,  q,r(j))  to  the  following  ALGOL  program  for 


j  -  l,...,t.   -If  (Ir(.+1)  =  1)  and  (lr(.+2)  =  1)  and  ... 
(Ir(t)  =  1)  then  (if  ir( .}  =  1  then  |Lr(j)|  else  Ir(j)  -  1) 
else  Ir(i)'"  Also  change  F(k,hq,  r(t) )  to  the  following  ALGOL 

program:  "if  (I  ,  ,  =  l)  then  |L  ,,  J  else  I  ,. %  -  1." 
—   r(t;  '  r(t)1  r(t) 


Example  7 • 


Let  R"(5,k,P,q)  =  (R5(k,p,q),...,R9(k,p,q))  =  (0,0,0,0,-1). 
Then  we  get  e  =  -1  and  l   =  9  and  R(5,k,p,q)  =  (6,8).  Also  assume  that  \lA   = 
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|Lq|  =  3«   Originally  S  and  S  may  look  like 

y  WyW  :=fp(--); 

Sq:   ..  :=  f  (...^[I^I^^yi],...); 
Now  after  Transformation  T^  is  applied,  S  and  S  become: 

y:  w^'VW  :=f  (--); 

Sq':   "  :=  tq^'"fAT/SIl'  I5>B6>I7>Bg,I9],...), 
where 

Bg  =  if  Ig  =  1  then  (if  Ig  =  1  then  3  else  Ig-l)  else  I, 

and  Bq  =  if  In  =  1  then  3  else  I«  -  1. 

Note  that  by  applying  Transformation  T?,  temporary  locations  are 
eventually  introduced.   For  example  in  Example  7,  A,  is  changed  to  a  seven- 
dimensional  array  from  a  four  dimensional  array  by  Transformation  T  . 

6.3.k     Parallelism  Between  Body  Statements 

6-3-i+-l  Introduction 

Now  we  describe  the  parallelism  between  body  statements.  As  stated 
before  it  becomes  necessary  to  modify  index  expressions.   In  this  section  we 
give  an  algorithm  which  modifies  index  expressions  properly. 

We  first  describe  the  algorithm  in  terms  of  a  restricted  loop  with 

only  one  loop  index,  i.e.  F^  =  (i  ) (S  ;S  ; . . . ;S  ).  Accordingly  every  array 
identifier  in  P™  is  of  a  form  A.[F(k,h,  1)]  (=  A.  [I,+w(k,h,  l)] )  where  this  is 

the  h-th  occurrence  of  A  in  F^.   For  convenience  we  drop  the  subscript  of 
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loop  index.   The  primitive  execution  order  for  p  becomes 

EQ(Prn)  =  {((i,p),V((i,p)|(|L|,m))|(l,l)  <  (i,p)  <  (|L|,m)}. 

For  a  given  loop  p  ,  we  consider  the  I-S  plane  which  is  an  L  by 
m  grid.  For  example  we  have  the  following  1+0x3  grid  for: 
for  I  :=  1  step  1  until  kO   do 
begin 
81:  A^I-1]  :=  A2[I]  +  A^I-l]; 

S2:  A^I]  :=  A^I+1]  -  A^I]; 

S3:  A3[I]  :=  A^I]  +  Ag[I]; 


end. 


On 


this  grid,  we  only  show  the  relation  DR,  e.g.  (A,[i-1],  (i-l,3)>  (i>l))  e  DR, 


SI 

S2 

S3 

# 

• 

i-1 

i 

V 

.St. 

-+ 

i+1 

• 

The  direction  of 
wave  front  travel 


Figure  6.12.  Wave  Front  for  Simultaneous  Execution  of  Body  Statements 
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Then  the  objective  of  this  section  is  to  discover  a  wave  front  W  (cf.  Section 
6.2.4)  which  separates  all  inputs  from  the  computation,  e.g.  in  Figure  6.12 
inputs  (shown  by  0)  to  S(  (i,l)),  S(  (i+1,2))  and  S((i,3))  (shown  by  t)  lie  above 
the  wave  front  indicated  by  a  dotted  line.  Hence  S((i,  l)),S(  (i+1,2))  and 
S((i,3))  can  be  computed  simultaneously  while  I  takes  values  1,2,  ...,40 
sequentially.   In  general  to  discover  a  wave  front  is  equivalent  to  discovering 
a  constant  C(p)  for  each  body  statement  S  so  that  all  statements  S((i-C(l),l) ), 

.. .,S((i-C (p),p )),..., S((i-C(m),m))  can  be  computed  simultaneously. 

6.3.4.2  The  Statement  Dependence  Graph  and  the  Algorithm 

Let  us  consider  the  I-S  plane  again  and  consider  the  computation 

S((i,p))-   Assume  that  there  is  id  such  that 
(id,(j,q),(i,p))eDR 

where  either  (i)  j  =  i  and  q  <  p,  or  (ii)  j  <  i  and  p  /  q,  then  clearly  S((i,p)) 

and  S((j,q))  cannot  be  computed  simultaneously. 

Definition  3' 

The  statement  dependence  graph  (cf .  the  dependence  graph  in  Section 

6.3-3),  D(p  ),  is  defined  by  a  set  N  of  nodes  1, 2,  .  ..,m  each  of  which 

corresponds  to  a  body  statement  of  P  and  the  arrow  relation  a.  From  node  p 
to  q  there  is  an  arrow  a(p, q)  if  and  only  if  either  one  of  the  following  two 
conditions  hold. 

(l)  For  fixed  i,  there  is  k  such  that 

AJJ[F(k,h,l)(i)]  €  OUT(S((i,p))), 
A*[F(k,g,l)(i)]  e  IN(S((i,q))), 


185 


F(k,h,l)(i)  =  F(k,g,l)(i) 
and  p  <  q. 

(2)  For  fixed  i,  there  exist  k  and  i'  such  that 
A£[F(k,h,l)(i')l  €  OUT(S((i',p))), 

A*[F(k,g,l)(i)]  €  IN(S((i,q))), 

F(k,h,l)(i«)  =  F(k,g,l)(i) 
and  i"  <  i. 

In  the  first  case  we  label  the  arrow  0  and  write  f(p,  q)  =  0.   In  the  second 
case  the  arrow  is  labeled  1  and  we  write  f(p, q)  =  1. 

The  statement  dependence  graph  for  the  previous  example  is: 


©  © — °-*6) 


A  chain  of  arrows,  a(prp2),  a(p2,p^),  . . .,  a(pk_1,pk),a(pk,p1)  in 

D(P  )  is  called  a  mesh  M  and  we  say  e.g.  a(p.,p.   )  is  in  M.   If  i(p.,p.   )  =  0 

for  some  arrow  in  M,  then  M  is  called  a  part  zero  mesh.   The  following  lemma 
is  obtained  immediately. 

Lemma  2 : 

If  D(P  )  contains  a  part  zero  mesh,  then  there  is  no  wave  front  for 

P*. 

Henceforth  we  assume  that  D(F  )  has  no  part  zero  mesh.  Given  D(FJ, 
we  define  a  subset  Z  of  N  as  follows: 

Z  =  (p   there  is  no  q  such  that  i(p, q)  =  0  or  f(q,p)  =  0} . 

Z  together  with  arrows  gives  a  subgraph  Dg  of  D(Fm).  Further  we  let 
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Zy.  =  (P   peZ  and  there  is  no  q  such  that  i(q,  p)  =  0}  . 
Now  we  give  an  algorithm  to  find  a  wave  front  for  D(Pm). 

Algorithm  1; 

(1)  Let  C(p)  =  +  oo  for  all  p  e  N. 

(2)  (i)  Take  any  p  from  ZL.   If  Z.  =  0,  then  go  to  Step  (5). 

(ii)   Let  C(p)  =  0. 

(3)  (i)   If  there  are  nodes  s  and  t  such  that  a(s,t)  exists,  £(s,t)  =  1  and 

C's)  >  C(t),  then  we  let  the  value  of  C(s)  be  equal  to  C(t). 
(ii)  If  there  are  nodes  s  and  t  such  that  a(s,t)  exists,  f(s,t)  =  0  and 
C(s)  >  C(t),  then  let  the  value  of  C(s)  he  equal  to  C(t)  -  1. 

Repeat  (i)  and  (ii)  until  there  are  no  s  and  t  which  satisfy  (i)  or  (ii)  in 

D(Fm). 

(h)      (i)  If  for  all  p  in  Z  C(p)  ^  +  »,  then  go  to  Step  (5). 

'Otherwise  take  any  p  from  Z  for  which  there  is  q  in  Z  such  that 

a(p,q)  exists  and  C(q)  /  +  »•   Let  M  =  max  (C(s)]  where  s  6  Z  and 

C(s)  ^  +  oo.   Then  let  C(p)  =  max  (c(q)  +  1,M}  .  Go  to  Step  (3). 

For  all  p  in  Z  with  C(p)  =  +  oo,  let  C(p)  =  M  where  M  =  max  (C(s)}  and 

seZ 

c(s)  ^  +  co.   If  Z      ■    $,    then  let  C(p)  =  0  for  all  p  in  Z. 


Example  ( 
(1)  D: 
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(2)  Z  =  fl,2,3,U,6,7,8l, 


KD-J^©         ©-^<T>^(D 


(3)  Let  C(l)  =  0  and  apply  Step  (3)  of  Algorithm  1.   Then  we  get  C(l)  =  0 

since  there  is  no  q  such  that  a(q, l)  exists. 
(k)     Let  C(2)  =  1. 

(5)  Let  C(3)  =  1. 

(6)  Let  C(*0  =  2.   Then  we  apply  Step  (3)  of  Algorithm  1  on  a(S,l+),  a(8,  5), 
a(5,2),a(2,l),a(7,8),a(6,7),a(9,6),a(3,9),a(3,l)  and  a(9,l)«  And  we  get 
C(5)  =  C(8)  =  2.   C(2)  =  1.   C(7)  =  1,  C(6)  =  0,  C(9)  =  0,  C(3)  =  0,  and 
C(l)  =  -1. 

(7)  There  is  no  p  in  Z  with  C(p)  =  +  ».   Hence  Algorithm  1  terminates  and 
we  get 

C(l)  =  -1,    C(2)  =  1,  C(3)  =  0,  COO  =  2,  C(5)  =  2,  C(6)  =  0,  C(7)  =  1, 
C(8)  =  2  and  c(9)  =  0. 
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Figure  6.13-   A  Wave  Front  for  Example  10 


188 


Now  we  show  that  Algorithm  1  gives  a  valid  wave  front.   To  prove  this  first 
we  show  that  Algorithm  1  is  effective,  i.e.  every  step  of  Algorithm  1  is  always 
applicable  and  terminates. 

Lemma  ~>: 

Algorithm  1  is  effective. 


Proof: 

That  Step  (l),(2),(k)   and  (5)  are  effective  is  clear.   Now  we  show  that 
Step  (3)  is  effective.   First  we  define  U(p)  to  be  a  set  of  nodes  such  that 
U(p)  =  (q  I  there  is  a  chain  of  arrows  a(p,  ,p  ),a(p  ,p_),  . . ., 

a(pn_x,Pn)  exist  where  j>±  =   q  and  pn  =  p)  U  {p},  e.g. 

3  6 

© •Ch KpyA^ »k: 


D(P8): 


and  U(5)  =  (1>2, h,  5>7>9) .   U(p)  =  U(q)  implies  that  there  is  a  mesh  which 
includes  p  and  q.   By  assumption  this  mesh  is  not  a  part  zero  mesh  and  c(q) 
will  be  assigned  the  same  value  as  C(p)  in  a  finite  number  of  steps  after  c(p) 
has  been  assigned  a  value. 

If  U(p) 3  U(q),  then  c(q)  will  be  assigned  a  value  less  than  or  equal  to 
C(p)  in  a  finite  number  of  steps  after  C(p)  has  been  assigned  a  value. 

Thus  after  a  finite  number  of  applications  Step  (3)  eventually  terminates. 
Hence  Algorithm  1  is  effective. 

(Q.E.D.) 
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Theorem  6: 

Algorithm  1  gives  a  valid  wave  front .  ■ 

Proof: 

To  prove  this,  it  is  enough  to  show  that'  (i)  if  f(p,  q)  =  0,  then 
C(p)  <  C(q)  and  (ii)  if  i(p,q)  =  1,  then  C(p)  <  C(q).  However,  from  Steps  (3) 
and  (k)   of  Algorithm  1,  clearly  the  above  conditions  hold.  Also  if  p  is 
assigned  a  value  C(p)  by  Step  (5)  it  implies  that  either  (i)  there  is  r  e  U(q) 
where  q  e  Z,  and  f(r,  q)  =  1  or  (ii)  there  is  no  such  r.   In  the  second  case 

C(p)  may  take  any  value  (S  may  be  computed  at  any  time),  and  in  the  first  case 

C(q)  >  C(r)  must  hold.   Hence  we  let  C(q)  =  max{C(s)} . 

(Q.E.D.) 
To  handle  a  restricted  loop  with  more  than  one  loop  indicies,  we 
modify  Definition  3  as  follows.   For  each  S  and  S  in  P  we  first  obtain  a 

vector  R(k,p, q). 

Definition  3'  '> 

The  statement  dependence  graph  of  tf    D(f  ),  is  defined  by  a  set  N  of 
nodes  1, 2, ...,m  each  of  which  corresponds  to  a  body  statement  of  P^  and  the 
arrow  relation  a.   There  is  an  arrow  a(p, q)  if  and  only  if  either  one  of  the 
following s  holds : 

(1)  R.(k,p,  q)  --  0  or  0  for  all  j  =  1, 2,  ..,n  and  p  <  q.  We  let  i(p,  q)  =  0. 

(2)  V((R1(k,p,q),...,Rn(k,p,q)|B)  >  V(  (0,  .  . . ,  0)  |  B)  where  B  =  (  |Lj, . . .,  |Lj  ). 
We  let  i(p, q)  =  1. 
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From  Definition  3'  clearly 

(1)  !(p,q)  =  0  if  S  ((i-^ig,  ...,1^))  US6S  the  outPut  of  Sp((i1;i2>"-.»i  ))  and 
p  <  q. 

(2)  f(p,q)  =  1  if  S  ((i^ig,  •••»in))  uses  the  output  of  S  ((^^ig',...,!  •)) 

where  (i^ig', . .  .,1^  )  <  (i^ig', . .  ,,in). 

Algorithm  1  is  then  applied  on  D(Fj.   For  example  let  P^  be 
SI:  A^Ig^g]  :=  A2[I2]  +  A3[I2-1,I3+1]; 

S2:  Ag[I2+l]  :=  A^I^Ig]  +  k^I^I^i 

S3:  A3[I2,I3]  :=  Ag[Ig-l]; 

Then  we  have 
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6.3*5  Discussion 

Given  a  loop  P  =  (I  ,  .  . .,  I  )  (S,  ; . . .  ;£L  ),  we  first  try  to  execute  body 
statements  in  parallel  with  respect  to  some  loop  index.   If  this  fails  for  any 
loop  index,  or  if  this  does  not  give  a  satisfactory  result,  then  we  try  to 
replace  the  loop  with  many  for  statements.   Then  we  can  attempt  to  execute  a 
body  statement  (or  body  statements)  of  a  resultant  for  statement  in  parallel 
with  respect  to  some  loop  index.   If  this  fails,  then  we  may  try  the  third 
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approach,  i.e.  we  try  to  execute  all  body  statements  simultaneously  while 
loop  indices  vary  sequentially.  Often  the  number  of  loop  indiciea,  n,  is 
very  small  (typically  n  =  2),    and  it  will  be  easy  to  try  all  variations. 
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7-   EQUALLY  WEIGHTED— TWO  PROCESSOR  SCHEDULING  PROBLEM 

T'l  Introduction 

This  chapter  gives  a  solution  to  the  so-called  equally  weighted- 
two  processor  scheduling  problem.   Informally  the  problem  may  be  stated  as 
follows.   Given  a  set  of  tasks  along  with  a  set  of  operational  precedence 
relationships  that  exist  between  certain  of  these  tasks,  and  given  two  identical 
processors  (PE),P(2),  how  does  one  schedule  these  tasks  on  the  two  processors 
so  that  they  execute  in  the  minimum  time?  It  is  assumed  that  either  one  of 
two  processors  is  capable  of  processing  any  task  in  the  same  amount  of  time, 
say  1  unit  of  time.   Informally  a  set  of  tasks  together  with  procedence  relations 
forms  a  graph. 

Clearly  the  problem  of  scheduling  any  given  equally  weighted  task 
graph  on  k  identical  processors,  P(k),  in  an  optimal  way  is  effectively 
solvable  by  exhaustion.   But  this  is  far  from  possible  in  practice.   The  only 
practical  solution  so  far  obtained  is  a  result  for  scheduling  a  rooted  tree 
(a  restricted  class  of  graphs)  with  equally  weighted  tasks  on  k  identical 
processors,  P(k)  [21]. 

Now  let  us  study  how  the  equally  weighted— two  processor  scheduling 
problem  is  related  to  the  computation  of  arithmetic  expressions  on  a  parallel 
machine. 

In  Chapter  3,  the  parallel  computation  of  an  arithmetic  expression  by 
building  a  syntactic  tree  was  studied.   There  we  were  only  concerned  with  the 
height  of  a  tree  and  reducing  it  by  distribution,  and  we  did  not  introduce  any 
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physical  restrictions  of  a  machine.  For  example,  in  reality,  the  size  of  a 
machine,  i.e.  the  number  of  PE's  is  limited  rather  than  arbitrarily  big.   One 
problem  which  will  arise  immediately  is  whether  the  distribution  algorithm 
should  be  applied  or  not  to  reduce  tree  height  since  distribution  introduces 
additional  operations.   For  example  assume  that  we  have  a  two  PE  machine,  P(2). 
Now  let  us  consider  two  arithmetic  expressions,  A  =  a(bc+d)  +  e  and  B  = 

abc(defgh+i).   Then  we  have  h[A]  =  k,    h[B]  =  5,  h[A  ]  =  h[abc+ad+e]  =  3  and 

h[B  ]  =  h[abcdefgh+abci]  =  k.      Thus  distribution  reduces  the  height  of  T[A]  and 

T[3].   On  the  other  hand  Figure  7-1  shows  that  if  A,  B,A  and  B  are  computed 

on  P(2),  A  is  still  computed  in  less  time  than  A  while  B  now  takes  more  time 

than  B. 

Assume  that  we  get  A  from  A  by  the  distribution  algorithm.   If  the 
size  of  a  machine  is  limited,  then  it  may  not  necessarily  be  true  that  A  can 

be  computed  in  less  time  than  A  even  if  h[A  ]  <  h[A]  holds.   Actually  it  is  a 
nontrivial  problem  to  decide  whether  distribution  is  to  be  made  or  not  to 
reduce  computation  time  (which  is  different  from  tree  height)  if  the  size  of 
a  machine  is  limited.   It  depends  on  the  form  of  an  arithmetic  expression  as 
well  as  the  machine  organization.   We  will  not  go  into  this  problem  any 
further. 

Now  let  us  look  at  the  situation  from  a  different  point  of  view. 
Given  an  arithmetic  expression  A  and  its  minimum  height  tree,  it  is  possible 
to  take  advantage  of  common  expressions  to  reduce  the  number  of  operations  to 
be  performed  in  hopes  of  reducing  computation  time.   For  example  let  us 
consider  the  computation  of  A  =  (a+b+c+d)ef  +  (a+b)g  on  P(2).  If  we  evaluate 
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(a+b)  only  once  then  A  can  be  computed  in  k   steps  on  P(2)  while  if  (a+b)  is 
evaluated  twice,  then  it  takes  5  steps  to  compute  A  (see  Figure  7*2). 


e  b   c 


cab     a  d  e 


(a)  A  (k   steps) 


(b)  A  (3  steps) 


i   h  d  e    f 


b   c 


c    1 


(c)   B  (5  steps) 


(d)  B   (6  steps) 


Figure  7-1.      Computation  of  Nondistributed  and 
Distributed  Arithmetic  Expressions 
on  P(2) 


Our  main  concern  in  Chapter  2  was  to  reduce  tree  height  assuming 
that  the  size  of  a  machine  is  unlimited.  Hence  we  were  not  interested  in 
reducing  the  number  of  operations.  As  mentioned  there,  it  was  an  open 
problem  to  find  out  common  expressions  while  keeping  the  height  of  a  tree 
minimum.   However,  if  we  could  take  advantage  of  common  expressions  while 
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level 


(a)  A  Minimum  Height  Tree  for  A 


5 
k 

3 
2 

1 

Step 


(b)   (a+b)  computed  twice 


(c)   (a+b)  computed  once 


Figure  T«2.   Common  Expression 


keeping  the  height  of  a  tree  minimum,  then  we  would  obtain  a  graph  of  operations 
rather  than  a  tree  for  an  arithmetic  expression  (see  Figure  7-2.  (c)). 

While  we  do  not  know  how  to  compute  an  arithmetic  expression  A  on 
P(2)  in  the  minimum  time  (e.g.  should  distribution  be  done?),  the  scheduling 
algorithm  presented  in  this  chapter  schedules  a  given  graph  of  operations  for 
an  arithmetic  expression  on  P(2)  so  that  the  given  graph  is  processed  in  the 
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minimum  time,  assuming  that  each  FE  of  P(2)  may  perform  addition  or  multipli- 
cation independently  but  in  the  same  amount  of  time,  say  1  unit  of  time.  Note 
that  we  may  be  able  to  construct  many  graphs  for  A.  Hence  while  the  scheduling 
algorithm  schedules  a  given  graph  for  A  on  P(2)  in  an  optimal  way,  the  algorithm 
does  not  necessarily  compute  A  itself  in  the  minimum  amount  of  time. 

7-2  Job  Graph 

Let  G  be  an  acyclic  graph  with  nodes  N.  (i=l,2, . . .,n)  and  a  set  of 

directed  arrows  connecting  pairs  of  nodes.   For  nodes  N  and  N'  we  write  N  -*  N' 
if  there  is  an  arrow  from  N  to  N' .  We  say  that  N  is  an  immediate  predecessor 
of  N'  and  N'  is  an  immediate  successor  of  N.  Also  we  let 

SR(N)  =  {N1  |N  -*  N'l  (a  set  of  successors  of  N) 
and  PR(N)  =  {N' |N*  -  Nl  (a  set  of  predecessors  of  N). 

Nodes  which  have  no  incoming  arrows  are  called  initial  nodes,  and 
nodes  which  have  no  outgoing  arrows  are  called  terminal  nodes.   For  the  sake 
of  simplicity  we  assume  that  a  graph  has  one  initial  node  and  one  terminal 
node.   If  there  are  more  than  two,  then  we  can  add  a  dummy  initial/ terminal 
node.   We  write  N  and  N  for  them,  respectively.  We  also  write  N  =5>  N'  if 

there  is  a  chain  N,  ,N_,  . . .  ,N  such  that  N  -»1  ■♦  ...  -»M  ->  N' ,  or  N  -*  N' . 
12m  1         m 

Furthermore  we  write  N  /  N'  or  N  ^>   N'  to  show  that  the  relation  N  -»  N'  or 

N»  I'  does  not  hold. 

Definition  1: 

The  forward  distance  (or  level)  from  the  initial  node  to  a  node  N, 
d  (N),  is  the  length  of  the  longest  path  from  the  initial  node  to  N,  thus 
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d_(NT)  =  0.   Similarly  the  backward  distance  from  the  terminal  node  to  N,  d  (n)j 


is  defined,  thus  •!.  (lO  =  0. 

Thus  a  node  N  cannot  be  initiated  before  time  cL(n)  but  may  be 
initiated  at  cL.(N)  or  at  any  time  after  that. 

Definition  2; 

The  height  of  a  graph  G,  h(G),  is  defined  as 
h(G)  =  d^V- 

Then  we  say  that  a  graph  G  is  tight  if  for  all  nodes  N, 
cL^N)  +  dT(N)  =  h(G). 

Otherwise  we  say  that  a  graph  G  is  loose. 
Example  1 ; 


Figure  T>3-   A  Loose  Graph  and  a  Tight  Graph 


Th 


e  graph  G,  is  a  loose  graph  because  d  (N  )  +  d_(Np)  =  2  ^  h(G. ) 


whereas  the  graph  Gp  is  a  tight  graph. 
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First  we  shall  study  an  optimum  scheduling  for  a  tight  graph.  A 
scheduling  for  a  loose  graph  will  be  discussed  in  a  latter  section.   In  what 
follows,  we  use  words  "process"  and  "schedule"  interchangably. 

Definition  3: 

Let  A(i)  be  the  set  of  all  nodes  of  forward  distance  i,  i.e. 
A(i)  =  (Nld^N)  =  il. 

Tli is  is  called  a  node  set. 

All  nodes  in  A(i)  can  be  scheduled  independently  of  each  other  since 
there  can  be  no  precendence  relations  between  nodes  in  A(i).   In  other  words, 
if  N  =>  N1  then  N  and  N'  cannot  be  processed  simultaneously. 

Now  we  have  the  following  lemma  which  characterizes  a  tight  graph. 

Lemma  1 : 

If  a  graph  G  is  tight,  then  for  every  node  N,  there  exists  N'e  SR(N) 
and  N"e  PR(n)  such  that  d  (N1  )  =  d-^N)  +  1  and  d^N")  =  ci^(N)  -  1.   (For  the 

terminal  node  SR(N  )  =  0  and  for  the  initial  node  PR(N  )  =  0.   Those  are 

exceptions. ) 

Proof: 

Obvious  by  Definition  2. 

(Q.E.D.) 

Corollary  1; 

Let  G  be  a  tight  graph.   Let  N  be  a  node  of  G.   Let  Ne  A(t).   Then 
for  every  i,  0  <  i  <  t  -  1,  there  is  at  least  one  node  N'e  A(i)  such  that  N'  => 
N.   Also  for  every  j,  t  +  1  <  j  <  h(G),  there  is  at  least  one  node  N"e  A(j) 
such  that  N  =>  N" . 
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Definition  k: 

To  p-schedule  a  set  Q  of  nodes  is  to  partition  Q,  into  subsets  of  size 
2  in  an  arbitrary  way  (if  |q|  is  odd,  there  will  be  one  subset  of  size  l)  and  to 
order  those  subsets  in  an  arbitrary  way. 

A  node  N  is  said  to  be  available  if  all  predecessors  of  N  have  been 
processed. 

7.3  Scheduling  of  a  Tight  Graph 

Having  these  definitions,  now  we  discuss  a  scheduling  of  a  tight  graph 
G  on  two  processors.   The  idea  of  this  scheduling  scheme  is  rather  simple.  We 
start  checking  |A(i)|  from  i  =  1  to  h(G).   For  each  i,  if  |A(i)|  is  even,  then 
we  p-schedule  A(i),  and  no  processor  time  will  be  wasted.   If  | A ( ± ) |  is  odd, 
and  if  we  simply  p-schedule  A(i),  then  there  will  be  one  node,  N,  left  which 
cannot  be  processed  in  parallel  with  another  node  in  A(i).   Thus  we  will  waste 
processor  time.   Therefore  a  node  which  can  be  processed  in  parallel  with  the 
above  left  out  node  N  must  be  found.   Where  can  that  node  be  found?  It  will  be 
shown  that  we  have  to  look  as  far  as  the  smallest  i'  larger  than  i  with  |A(i')| 
=  odd  to  find  it.   Thus  the  amount  of  work  to  look  ahead  is  always  bounded. 

Before  we  go  further,  a  few  more  definitions  are  in  order. 

t    n 
For  some  t  and  n,  let  us  consider  a  set  A  =  U  A(t+i).   Now  let  us 

n   i=0 
take  a  node  from  each  of  A(t+,j)  and  A(t+i)  (j  <  i).   Let  them  be  NJ  and  N1.   If 

N  £>   IT",  then  N  and  N  may  be  processed  simultaneously,  providing  that  all 

predecessors  of  N  and  N"  have  been  processed. 

Now  we  establish  this  relati  on  formally  on  A  . 
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Definition  5: 


,t  . 


The  p-line  relation  (  )  between  two  nodes  N  and  N'  in  A  is  defined 
n 


as  follows. 

E 

WW  if  and  only  if 

(1)   N  ^>  N'  and  N1  £>  N 
and  (2)   d  (n)  ^  d(W  ). 

We  also  write  (N,  W  )  for  WW.  A  pair  (N,W)  is  called  a  p-line  pair. 
Note  that  in  general  (N,N')  and  (W,N" )  do  not  necessarily  imply  (N,N"). 

Further  we  define  A^p)  -  {(N,N')  |  N€A(t+i),N'eA(t+j),0  <  i,j  <  n] , 

i.e.  A  (p)  is  a  set  of  all  pairs  of  nodes  in  A  between  which  the  p-line  relation 

n  f  *  n  * 

holds . 

-L.  J- 

Since  (N,N')  e  A  (p)  implies  that  (W>N)e  A  (p),  we  shall  in  general 
put  only  one  of  them  in  A  (p)  and  drop  the  other. 

An  algorithm  to  find  A  (p)  is  given  in  Section  7-5- 


Definition  6: 

A  p-line  set  L  on  A  is  an  ordered  set  of  p-line  pairs 

*■ n     n 

Ln  =  ((N0,N1'),(NrN2'),...,(Nk,N,k+1)) 


where 


(1)  NQ  e  A(t)  and  N'k+1  e  A(t+n), 


and  (2)   for  all  g(l  <  g  <  k),ci  (N  ')  =  cL  (N  )  but  N  '  +   N  . 
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t  t     t 

We  say  that  A  is  p-connectable  if  there  is  a  p-line  set  L  on  A  . 
n    * n     n 

t     '  t 

Also  we  write  L  (N~,  N  ,  ,)  when  the  first  and  the  last  nodes  in  L 
n  u   K+l  n 

are  of  special  interest. 

Further,  we  write  L^N-.N'   .  )  =  L^(Nn,N  ')  U  Lt+:J-(N  ,N'   )  if 

n  0'  k+l     1  0'  g      n-i  g'  k+l 

(N  ,,N*  )  and  (N  ,N')  are  two  adjacent  elements  of  L  (Nn,N'   )  and  d_(N' )  = 
g-i  g       g  g+i  n  u  K+i       l  g 

dj(N  )  =  t  +  i. 

An  algorithm  to  build  a  p-line  set  for  A  is  given  in  Section  7»5« 


Example  2 : 


A(l)   =    (b,c,d) 
A(2)   =    {e,f} 
A(3)  =   (g,h,i) 


Figure  f.k.      A  Graph  G 


1         d  1 

v     =     U  A(l+i)   =    {b,c,d,e,f,g,h,i}.      AQ(p)   =    ((b,f),(d,e),(f,h),(e,i)).      A 


1=0 
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typical  L  =  ((b,f),  (e,i)).   Hence  A  is  p-connectable. 

Further  we  define  a  special  p-line  set  called  a  p-line  (l)  set. 

Definition  7* 

Given  a  set  A  .  We  call  a  p-line  set 
n 

a  p-line  (l)  set  (write  L  (l))  if 

cL(N.  )  +  1  =  (L(N^+1)  for  all  i(0  <  i  <  k). 

Note  that  in  this  case  k  =  n  -  1. 

Also  we  write  L  (l)(N~,N,'  ,,)  when  the  first  and  last  nodes  are  of 
n     0'  k+ly 

particular  interest. 

Now  a  few  lemmas  are  in  order. 

Lemma  2: 

Suppose  N  e  A(t)  and  N'  e  A(t+n)  for  some  t  and  n  in  graph  G.  Assume 
that  (N,  N')  holds.   Then  there  is  a  p-line  (l)  set 

I>(1)(N,N')  =  ((N0,N1'),(N1,N2'),...,(Nn_1,Nn')) 

where  N^  =  N  and  N  '  =  N1 ■ 
0         n 

Proof: 

A  proof  is  given  by  an  induction  on  n. 

First  note  that  |A(t+i)|  >  2  for  all  i,  1  <  i  <  n  -  1.   Otherwise 
N  $*  N'  holds  and  (N,N')  does  not. 
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(1)  Now  let  n  =  2.  Then  there  must  be  two  distinct  nodes 

N  ,Np  e  A(t+1)  such  that  N  -  N,  and  N  -*   N' .   Otherwise  the  graph 

G  is  not  tight.  Hence  (N^N1)  and  (N,N2).   Thus  1^(1)  =  ((N,Ng), 

(NrN')). 

(2)  Now  assume  that  the  lemma  holds  for  n  <  i. 

Let  n  =  i  +  1.   Let  N  e  A(t),  N'e  A(t+i+l),  and  (N,N')«   Then 
there  must  be  two  distinct  nodes  N  ,N  e  A(t+i)  such  that  (N,N..) 

and  N  $>  Np.   This  follows  from  Lemma  1  and  Corollary  1.   Then 

(Np,N*)>  since  otherwise  N  =>  N'  holds  and  this  contradicts  the 

assumption.  By  the  induction  hypothesis,  there  is  a 

L^_1(1)(N,N1)  =  ((N,irL'),(N1,^,),...,(Mn"2,N1)). 
Thus  there  is  a  p-line  (l)  set 

L*(1)(F,H')  =  L^_1(1)(N,N1)  U  (N2,N«)  =  ( (N,  N1'  ),  (n\  N2'  ), 

...,(Mn'2,N1),(N2,N')). 

(Q.E.D.) 


Lemma  3: 


Suppose  that  N  e  A(t),N1,N2£  A(t+i),N^,N  e  A(t+j),  and  N1  e  A(t+n) 
where  i  <  j  <  n.  Also  assume  that  (N,N  ),(N  ,N*),  and  (n  ,N')  hold. 

Then  there  is  a  p-line  (l)  set  L^(l)(N,N')  =  ((N,^1 ),...,  (N  ^N')). 


Proof: 
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Figure  7-5*  An  Illustration  for  Lemma  3 

Since  (N, N  )  by  the  previous  lemma,  there  is  a  p-line  (1)  set 
L^(1)(N,N3)  =  Lj(l)(N,N)  U  L^(1)(N',N3), 

u  -*-  u 

and  since  (rr,N'),  there  is  a  p-line  (l)  set 

L^aKw^N')  =  L^CDCN2^)  U  Lt+^(1)(NSN'). 
n-i      '       j-iv     '      n-j    v  ' 


By  definition,  N  ^  N' ,  N  /  N' ,  W  £   N  and  N1  ^  N2. 
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Now  we  have  two  cases. 


(1)  N5  =  N' 

Then  N  ^  N1.   Now  let 


l*(i)(n,n')  =  l*(i)(n,n)  u  lJ^CDCh1,/)  U  L^  (l) ft. ,r ) . 

(2)   N5  ^  N'. 
Then  let 

L*(i)(H,r)  =  lJ(i)(n,n3)  U  L^(l)(N',N'). 

Thus  in  either  case  there  is  a  p-line  (l)  set  on  A  . 

n 

(Q.E.D. ) 

From  Lemmas  2  and  3>  we  can  prove  the  following  lemma. 

Lemma  k: 

If  A  is  p-connectable,  then  there  is  a  p-line  (l)  set  L  (l)(N,  N') 

where  N  e  A(t)  and  N1  e  A(t+n). 

What  Lemma  h   implies  is  the  following. 

Let  L*(l)(N,N')  =  ((N,N1'),(N1,N2'),(N2,N3'),...,(Nn_1,N')),  i.e. 

dI(N.')  =  cL^N.)  =  t  +  i  and  cL^N.)  +  1  =  ^(N^). 

Now  assume  that  a  set  A  is  p-connectable.   Then  we  have  L  (l)(N,N')  = 

n  n    ' 

((N,^'),  (NrN2'),  ...,(Nn_1,N'))  where  N  e  A(t)  and  N'  e  A(t+n).   Since  (N.,N'), 
we  can  process  both  at  the  same  time.   To  do  this  we  first  process  A(t+i)  -  {N.} 
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(notice  that  djC^)  =  t+i).   Then  process  (N.,N'   }.   Finally  process  A(t+i+l)  - 
(N!  ,}.  This  leads,  us  to  the  follovring  scheduling- 


Definition  8: 


Assume  A     is  p-connectable.      Then  by  to  p-line  schedule  A    bv  I  (l). 
n  *  n  n       ' 

we  mean  the   following  scheduling. 

Let  L*(l)   =    ((N0,N1'),(N1,N2,),...,(Nn-1,Nn')). 

(1)  p-schedule  A(t)   -   {NQ}  . 

(2)  g  =  1 

(3)  p-schedule   {N  _   ,N   '} 

o  o 

(h)     p-schedule  A(t+g)  -  {N  ',N  ]. 

D       g 

(5)  g  =  g  +  1.   If  g  <  n,  then  go  to  (3). 

(6)  p-schedule  (N  ,,N  '}  . 
v  '     *  n-1'  n 

(7)  p-schedule  A(t+k)  -  (N  '}• 

Now  an  algorithm  to  schedule  a  tight  graph  for  two  processors  is 
described. 

Scheduling  is  done  according  to  node  sets  A(i),  for  i  =  1,2,  . .  .,h(G). 

All  nodes  in  A(i)  can  be  processed  independently  of  each  other,  i.e., 
in  any  order. 

If  |A(i)|  is  even,  then  two  processors  can  be  kept  busy  all  the  time 
to  process  A(i)  and  A(i)  can  be  processed  in  time  |A(i)|/2. 

If  |A(i)|  is  odd,  then  there  is  a  possibility  that  a  machine  becomes 
idle,  i.e.,  one  node  will  be  left  out  from  A(i)  which  does  not  have  any  partner 
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to  be  processed  with.   Let  it  be  N.   Then  a  partner  must  be  found  from  some 
A(j),  j  >  i.   First  we  may  try  to  find  N'  €  A(i+l)  which  can  be  a  partner  of  N. 
However  if  |A(i+l)|  is  even,  then  |A(i+l)  -  (N'} |  becomes  odd  and  we  have  the 
same  problem  again,  and  may  try  to  find  a  node  from  A(i+2)  to  fill  an  idle 
machine,  and  so  on.   If  this  cycle  is  ever  to  stop,  it  must  stop  when  A(i+k) 
is  hit  where  |A(i+k)|  is  odd.  Otherwise  there  is  no  way  to  remedy  the  cycle, 
and  machine  time  must  be  wasted. 

Tight  Graph  Scheduling  Algorithm: 
Step  1:  t  =  0 

Step  2:  If  t  =  h(G)  then  p-schedule  A(t)  and  stop,  else  go  to  Step  3* 
Step  3:  If  |A(t)|  is  even,  then 
(3-1)  p-schedule  A(t)  and 
(3-2)  go  to  Step  7- 
Step  k:      |A(t) |  is  odd. 

Find  A(t+1). 
Step  5:  If  VN  e  A(t)  |SR(N)|  =  |A(t+l)|,  then 
(5-1)  p-schedule  A(t)  and 
(5-2)  go  to  Step  7. 
Step  6:  There  is  N'  e  A(t)  such  that  |SR(N')|  <  |A(t+l)|. 
(6-1)  If  |A(t+l)|  is  odd,  then 

(6-1-1)  p-schedule  A(t)  -  (N']- 

(6-1-2)  p-schedule  (N*}  U  {N"l  where  N"  e  A(t+l)  -  SR(N'). 

(6-1-3)  p-schedule  A(t+1)  -  (N") . 

(6-1-U)  go  to  Step  7- 
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(6-2)   |A(t+l)|  is  even. 

(6-2-1)  Find  out  the  smallest  k  greater  than  1  such  that  |A(t+k)| 

is  odd. 
(6-2-2)  If  there  is  no  such  k  (i.e.,  we  have  checked  up  to  A(h(G))) 

then  p-schedule  A(i)  t  <  i  <  h(G)  individually,  and  stop. 
(6-2-3)  Else  we  have  a  set  (2=   (A(t),A(t+l), . .  .,A(t+k)}  where  |A(t)| 

and  |A(t+k)|  are  odd  and  other  |A(t+i)|  are  all  even.  Check 

p-connectivity  of  A,  . 

(1)  A,  is  not  p-connectable. 

(l-i)  p-schedule  A(t),A(t+l), .. .,A(t+k-l)  individually. 
(1-ii)  Let  t  =  t  +  k  -  1. 
(l-iii)  go  to  Step  "J. 

(2)  A,  is  p-connectable. 

(2-i)  Find  a  p-line  (l)  set  l£(l)  =  ( (NQ,  N^  ),  (N^N^ ),  . . ., 
(Nk+1,Nk')). 

(2-ii)  p-line  schedule  A,  by  L.  (l). 

(2-iii)  t  =  t  +  k. 
(2-iv)  go  to  Step  7- 
Step  7:   t  =  t  +  1. 

Go  to  Step  2. 

Example  3 : 

(l)   |A(l)|  and  |A(3)|  are  odds,  and  |A(2)  |  is  even.  Thus  we  have  &= 
{A(1),A(2),A(3)1  -   (by  Step  6  of  Algorithm) 
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A(l)  =  {a,b,c,d,e} 
A(2)  =  (f,g) 
A(3)  =  (h,i,J). 


Figure  7-6.  An  Example  of  a  Tight  Graph  Scheduling 


(2)  For  A^,  we  have  Lg(l)  =  ( (d,  f),  (g,h)). 

(3)  According  to  Step   (6-2-3) (2),   we  schedule  as   follows. 

(i)     p-schedule  A(l)   -   {d}    =   (a,b,  c,  e). 
(ii)     p-schedule   { d, f } . 
(iii)     p-schedule  A(2)   -   (f,g)    =  0. 
(iv)     p-schedule   (g,h)  . 
(v)     p-schedule  A(3)   -   (h)    =   {i,j}. 
Thus  we  have  an  optimum  schedule : 
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Step      12     3^5 

machine  A 


B 


a     c    '  d 


g     i 


b     e     f 


h     J 


Proof  for  the  algorithm: 

Lemma  5' 

Step  3  is  optimum  and  "whatever  p-schedule  is  made,  it  does  not  affect 
the  later  stages. 

Lemma  6: 

Step  5  is  optimum  and  whatever  p-schedule  is  made,  it  does  not  affect 
the  later  stages. 

Proof: 

First  note  that  after  A(t)  has  been  processed,  nodes  in  A(t+l)  only- 
can  become  available.   Since  VN  e  A(t), |S(N)|  =  |A(t+l)|,  all  nodes  in  A(t) 
must  have  been  processed  before  any  node  in  A(t+l)  can  be  processed. 

(Q.E.D.) 


Lemma  7 • 

Step  6-1  is  optimum  and  whatever  p-schedule  is  made,  it  does  not 
effect  the  later  stages. 

Proof: 

The  algorithm  actually  processes  A(t)  and  A(t+l)  (where  |A(t) |  and 
|A(t+l)|  are  odd)  in  time  ( |A(t)  |  +  |A(t+l) | )/2,  which  is  optimum. 

(Q.E.D.) 
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Lemma  8 : 

Step  6-2-2  is  optimum. 

Proof: 

Lettf=  (A(t),A(t+l),  ...,A(h(G))).   Since  |A(t)|  is  odd  and  all  other 


|A(i)|  is  even  (t  <  i  <  h(G)),  it  takes  at  least  time 
I"  lA(t)K!A(t+l)l+...^|A(h(G))l  -| 


to  process  cl. 

Step  6-2-2  achieves  this. 

(Q.E.D.) 

Lemma  9: 

Assume  |A(t)|  and  |A(t+k) |  are  odd  and  |A(t+i)|  are  all  even  (l  <  i  < 
k).  If  A,  is  p-connectable,  then  Step  6-2-3  (2)  is  optimum. 

Proof: 

Oc=  {A(t),A(t+l),  . .  .,A(t+k)}  can  be  processed  in  time 

|A(t)|+|A(t+l)|+...+|A(t+k)| 
2 

Step  6-2-3   (2)  achieves  this. 

(Q.E.D.) 

Lemma  10: 

Assume  |A(t)|  and  |A(t+k) |  are  odd  and  |A(t+i)|  are  all  even  (l  <  i  < 

k).  If  A,  is  not  p-connectable,  then  Step  6-2-3  (l)  is  optimum. 
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Proof  : 

Let£^=  (A(t),A(t+l),  . .  .  ,A(t+k)}  .  Since  A^  is  not  p-connectable, 

there  is  no  p-line  set  1^(1)-  Thus  there  will  be  N  in  some  A(t+i)  (0  <  i  <  k) 

which  does  not  have  any  partner  to  be  processed  with  it,  thus  making  a  machine 
idle.  Now  if  this  situation  could  be  remedied,  then  it  would  be  only  because 
there  is  N'  e  A(t+n+j)  (j  >  0)  which  can  be  done  in  parallel  with  N. 


A(t+n) 


A(t+n+j) 


odd 


even 


odd 


Figure  T«T-  An  Illustration  for  Lemma  11 


Suppose  that  the  above  could  be  done.   The  parallel  processing  of  N 


t  , 


and  N1  can,  however,  be  advantageous  only  if  A.  is  p-eonnected  and  there  is 
a  p-line  (l)  set  L^(l)(N,Nf )  where  N  e  A(t),N'  e  A(t+i)  and  N'  4   N-   Otherwise 
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processors  cannot  be  kept  busy  for  A(t),A(t+l),  . . . ,  A(t+i)-{N  )  and  we  do  not 
gain  at  all  by  delaying  a  processing  of  N.   Now  from  the  assumption,  N  £>  N" 

for  all  N"  €  A(t+n)  because  otherwise  A  becomes  p-connectable.  Also  by 

n  " 

Corollary  1,  there  is  a  node  N  in  A(t+n)  such  that  N  >  N' .  This  implies  that 
N  5>  N'.  Thus  N  cannot  be  processed  in  parallel  with  N.   This  proves  the 

lemma. 

(Q.E.D.) 

Essentially  what  the  above  algorithm  does  is  upon  finding  A(t)  where 
|A(t)|  is  odd  to  try  to  delay  the  processing  of  a  node  N  in  A(t)  till  the  next 
node  set  A(t+l).   If  |A(t+l)|  is  even,  then  again  the  algorithm  tries  to  delay 
the  processing  of  a  node  N'  in  A(t+l)  till  the  next  level  A(t+2),  etc.  until  it 
finds  some  A(t+k)  where  |A(t+k) |  is  odd,  or  |A(t+k)  U  (N"} |  is  even  where  N" 
is  a  node  whose  processing  is  being  delayed  from  the  previous  stage.   In  other 
words,  the  algorithm  tries  to  establish  p-connectability  between  two  adjacent 
node  sets  both  of  which  have  an  odd  number  of  elements.  Note  that  it  is  not 
necessary  to  try  to  establish  p-connectability  among  more  than  two  odd  node 

sets,  say  A(t),A(t+k)  and  A(t+m)(m  >  k),  because  A  cannot  be  p-connectable  if 

A,  is  not  (see  Lemma  10). 

Now  the  above  argument  together  with  Lemmas  5-10  prove  the  following 
theorem. 

Theorem  1: 

The  algorithm  gives  an  optimum  schedule  of  a  tight  graph. 
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1  .k     Scheduling  of  a  Loose  Graph 

Now  let  us  extend  the  above  algorithm  so  that  it  can  handle  a  loose 
graph.  To  facilitate  presentation,  a  few  definitions  are  in  order. 

(From  now  on  by  "a  graph",  a  "loose  graph"  is  to  be  understood. ) 

Definition  9; 

A  node  N  in  a  graph  G  for  which  d_(N)+dT(N)  ^  h(G)  is  called  a  loose 

node.   Otherwise  a  node  is  tight.   Let  N  be  a  loose  node.   Then  we  define  the 
far  distance  d_(N)  as  <L(N)  =  h(G)  -  d^N). 

A  set  B(i)  is  a  node  set  (cf.  Definition 
k)   such  that 

B(i)  =  {N|(N  is  a  tight  node)  and 
(d^N)  =  i)}  U  {N|(N  is  a 

loose  node)  and  (cL(N)  =  i)}  . 

We  write  Bt(i)  and  Bi(i)  for  the  above 
two  component  sets  respectively,  i.e. 

B(i)  =  Bt(i)  U  B*(i). 

Note  that  a  loose  node  N  can  be 
processed  anywhere  between  cL(N)  and 


di(V 


d  (N)  without  affecting  the  rest  of  a 


Figure  7-8.   A  Loose  Node 


graph.   Definition  9  says  that  loose 
nodes  are  pushed  far  down  and 

classified  in  terms  of  the  far  distance  rather  than  the  forward  distance. 

Note  that  a  loose  node  N  receives  two  attributes,  the  far  distance 

d  (N)  and  the  forward  distance  d  (N),  and  is  classified  in  terms  of  cL(N), 
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e.g.  we  say  that  a  loose  node  N  with  the  forward  distance  d  (N)  is  in  B(i) 
where  i  =  cL.(N). 

We  also  define  a  set  C(i). 

Definition  10: 

A  set  C(i)  of  loose  nodes  is  defined  as  follows. 
C(i)  =  (N  |  N  is  a  loose  node  and  <L(n)  <  i  <  d_(N)}  . 

C(i)  is  a  set  of  those  loose  nodes  which  may  be  processed  in  parallel 
with  a  node  in  B(i).   If  cL.(N)  =  i,  then  N  is  put  in  B(i)  rather  than  in  C(i). 

C(i)  is  a  set  of  loose  nodes  which  will  be  used  to  fill  up  waste  processor  time 
if  necessary. 

Scheduling  a  loose  graph  is  similar  to  that  for  a  tight  graph. 
A  loose  graph  is  scheduled  in  accordance  with  B(i)  for  i=l, 2, . . .,h(G) .  A  loose 
node  N,  even  though  it  is  in  B(cL(N)),  may  be  scheduled  with  any  B(k)  where 

d_(N)  <  k  <  d  (N).  All  tight  nodes  are  scheduled  first  and  loose  nodes  are 

scheduled  as  late  as  possible.   If  it  becomes  inevitable  to  waste  processor 
time  if  only  tight  nodes  are  used,  then  loose  nodes  are  used  to  fill  those 
otherwise  wasted  times. 

In  what  follows  B(i)  plays  a  similar  role  to  that  of  A(i)  in  the 
previous  discussion.  All  definitions  for  A(i)  are  also  applicable  to  B(i)  with 
a  few  modifications. 

Now  the  p-line  relation  (==)  between  two  nodes  N  and  N'  in  B  is  re- 

n 

defined  as  follows. 
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Definition  5' : 

t    £ 
Let  N  and  N'  be  two  nodes  in  B  .  NTT  ((n,N' ))  holds  if  either  one  of: 

(l)  (i)  N1  is  not  a  loose  node  (N  may  be  a  loose  node), 
and  (ii)  (^(N)  (or  d^N)  if  N  is  a  loose  node)  +  1  =  cL(N')^ 

(In  other  words,  N  e  B(t+k)  and  N'  e  Bt(t+k+l),  0  <  k  <  n. ) 
and  (iii)  N  /  N'. 
or  (2)   (i)  N*  is  a  loose  node  (N  may  be  a  tight  node), 

and  (ii)   MN'  )  <  d_(N)  (or  d_(N)  if  N  is  a  loose  node)  <  d_(ir) 

(In  other  words,  N*  €  Bl(t+k)  and  N  e  B(t+j)  where  <L. (N1  )  < 

t+j  <  t+k  <  t+n.  )  If  (L-(N')  <  t,  then  the  above  inequality 

becomes 

t  <  d^(N)  (or  IjW)  <   d^N1  ) 

holds . 

Then  L  on  B  is  defined  similarly  to  Definition  6. 
n     n 

Example  k    (see  Figure  7»9)- 

(2,5),  (3,*0,(^,8),(5,7)  e  Bn(p)  by  (!)'  Since  <W  =  t  +  1  and  dJ(N6)  = 
t  +  3,  (2,6),'(5,6),  (1^,6),  (5,6)  e  B*(p)  by  (2). 
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B(t)   =    (N^ 


B(t+1)   =   {N2,N31 


B(t+2)   =   {NVN51 


V  B(t+3)   =    (N^N^Ngl 

Bt(t+3)   =    {N7,Ng),Bl(t+3)   =    {N6} 


B(t44)    =    {N  } 


Figure  7.9.  The  p-line  Relation  in  B. 


n 


(1)  of  the  above  definition  resembles  Definition  5  and  takes  care  of 
tight  nodes.   As  Lemmas  2,  3  and  h   showed  Definition  5  is  more  general  than 
necessary.   That  is,  the  p-line  relation  need  only  to  be  defined  between 
adjacent  levels,  i.e.  A(i)  and  A(i+l).  And  Definition  5'  follows  this 
simplification. 

(2)  ,  on  the  other  hand,  takes  care  of  loose  nodes.   If  a  loose  node 
N  is  in  B(i)  then  this  means  that  d. (N)  =  i  and  N  may  be  processed  in  parallel 

with  any  node  N'  e  B(j)  where  d_(N)  <  j  <  i.   Because  of  this  addition,  Lemmas 

2,  3  and  k   do  not  hold  anymore.   For  example  let  us  consider  the  graph  G: 
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©   B(t+1) 


It  is  easy  to  see  that  (N  ,N  )  e  B,(p)  by  (2)  of  Definition  5'.  Since  (N  ,N  ) 

/  B,(p),  there  is  no  p-line  (l)  set  on  B,(,p).   This,  however,  does  not  bother 

us.   Assume  that  B  is  p-connectable,  and  let  L  =  ( (tL.N  ' ),  (Nn  ,  N  ' ), . . ., 
n  nO'11'2'' 

(N.,N.   '),...,  (N,,,N' )),  where  N  €  B(t)  and  N  '  e  B(t+n).   Further  assume 

that  N.  e  B(t+a. )  and  N.   *  e  B(t+a.  ,).   If  a.  +  1  /  a.  .,  then  by  Definition 
i       i       l+l        l+l       i     '      l+l7       J 

5'  N.   '  is  a  loose  node  and  d_(N.   ')  <  t  +  a. .   Thus  N.   '  can  be  processed 
l+l  I  l+l   —      l        l+l 

in  parallel  with  N.  without  affecting  the  rest  of  a  graph,  i.e.  we  first  process 

B(t+a.  )  -  fN.)  and  then  process  (N..N.   '].   If  a.  +  1  =  a.  ,,  then  the  old 
v   iy     l  *  i'  l+l        i       i+l' 

strategy  can  be  applied,  i.e.  we  first  process  B(t+a. )  -  {N.},  then  {N.,N.   '}, 

then  B(t+a.  .)  -  {N.  n '} . 
l+l      l+l  ' 

This  leads  us  to  modify  Definition  8  as  follows. 
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Definition  8': 


Given  B*   and  L*,    where   L^  =    ((N^N^  ),  (N^M,  •  • -,  (N^N^' ), . . ., 
(N,    ,, N.  '))•     Then  by  to  p-line  schedule  B    by  L  ,   we  mean  the  following 

scheduling. 

(1)  p-schedule  B(t)   -   {NQ}  . 

(2)  g  =  1. 

(3)  p-schedule   (N       ,N   '}  • 

g-J-        6 

(U)  Let  N  ,  e  B(t+a  .  )  and  N  '  e  B(t+a  ). 
g-1       g-1       g        g 

(i)  If  a  .  +1  =  a  ,  then  p-schedule  B(t+a  )  -  {N  '  N  }. 
g-1       g  g      g  '  g 

(ii)  If  a  ,  +  1  4   a  >  then  p-schedule  B(t+b)  where  b  =  a  ,  + 
g-1     T     g  g-1 

1,  a  ,  +  2,  ....  a  -  1.   Then  p-schedule  B(t+a  )  - 
g-1  g  g 

{N  f,N  }  . 

g   g 

(5)  g  =  g  +  1.   If  g  <  k,  then  go  to  (3). 

(6)  p-schedule  (Nj^N'}. 

(7)  p-schedule  B(t+k)  -  [N.'}- 

Finally  in  connection  with  Definition  8',  we  define  the  following. 

t  t     t 

Suppose  that  B  is  not  p-connectable,  i.e.  there  is  no  p-line  set  L  on  B  . 
*  n       *  n     n 

It  is,  however,  possible  to  find  a  p-line  set  L  on  B  for  some  s,  0  <  s  <  n. 

Definition  11: 

t  t 

Given  B  which  is  not  p-connectable.   Now  we  check  if  B  is  p- 
n  s 

connectable  for  s  =  1,2, . ..,n-l.   Let  m  be  the  smallest  s  such  that  B  is  p- 

S 
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connectable  but  B  .  is  not.  We  call  m  the  maximum  p-connectable  distance, 
s+l  ■     *  ' 


and  L  a  maximum  p-line  set. 

m   *      


The  following  example  illustrates  the  above  definition. 


Example  5; 


B(t+1) 


Figure  7*10.  An  Example  of  the  Maximum  p-connectable  Distance 


Let  us  consider  p-connectability  of  B^.   It  is  clear  that  B^,(p)  = 
{ (N,,N,  ),  (N,  ,N/-),  (N  ,  No)} ,  and  B,  is  not  p-connectable.   Since  B,  is  p- 
connectable  (N  ,  N.  )  but  B  is  not,  the  maximum  p-connectable  distance  in  the 
above  B,  is  1  and  L  =  ((N  ,N,  )). 


221 


Using  a  similar  technique  to  the  one  described  in  Section  7«5>  we  can 
check  p-connectability  of  B  . 

An  algorithm  to  schedule  a  loose  graph  for  two  processors  is  now 
given.   The  algorithm  resembles  the  one  for  a  tight  graph.  The  major  difference 
lies  in  the  treatment  of  loose  nodes.  Loose  nodes  are  scheduled  as  late  as 
possible.   They  will  be  used  when  it  becomes  inevitable  to  waste  processor 
time. 

In  what  follows  we  modify  the  definitions  for  B(i)  and  C(i)  so  that 
they  do  not  include  those  loose  nodes  which  hare  been  already  scheduled. 

Loose  Graph  Scheduling  Algorithm 

Step  1:  t  =  0. 

Step  2:  If  t  =  h(G)  then  p-schedule  all  unscheduled  nodes  and  stop,  else  go 

to  Step  3» 
Step  3:  If  |B(t)|  is  even,  then 
(3-1)  p-schedule  B(t) 
(3-2)  go  to  Step  7- 
Step  k:      |B(t)|  is  odd. 

Find  B(t+1). 
Step  5:   If  V  N  e  B(t)  |SR(N) |= |B(t+l) |, 
then 
(5-1)  Check  C(t).   If  C(t)  =  0,  then  p-schedule  B(t)  and  go  to  Step  7. 
(5-2)   Otherwise  pick  N  with  the  minimum  cL(n)  in  C(t).   (if  there  are 

more  than  one  such  N,  pick  any  N. )  Now  p-schedule  B(t)  U  {N}  and 
go  to  Step  7* 


Step  6:      There  is  N'    e   B(t)   such  that    |SR(N')|   <   |B(t+l)|. 
(6-1)     If    |B(t+l)|    is   odd,    then 

(6-1-1)     p-schedule  B(t)  -   {N1}. 

(6-1-2)  p-schedule  (N')  U  (N")  where  N"  e  B(t+l)  -  SR(N'). 
(6-1-3)  p-schedule  B(t+l)  -  {N"}  . 
(6-1-U)  go  to  Step  7- 
(6-2)   |B(t+l)|  is  even. 

(6-2-1)  Find  out  the  smallest  k  greater  than  1  such  that  |B(t+k)|  is 

odd. 
(6-2-2)  If  there  is  no  such  k  (i.e.  we  have  checked  up  to  B(h(G))), 
then  p-schedule  B(i)  (t  <  i  <  h(G))  individually  and  stop. 
(6-2-3)  Else  we  have  a  set  &  =    (B(t),B(t+l), . . .,B(t+k)}  where  |B(t)| 
and  |B(t+k)|  are  odd  and  other  |B(t+i)|  are  all  even.  Check 

if  B,  is  p-connectable. 
k 

(l)  B,  is  not  p-connectable. 

(l-i)  Find  the  maximum  p-connectahle  distance  in  B,  .   Let 

K. 

it  be  m. 

(1-ii)  Check  C(t+m).   If  C(t+m)  =  (/>,    then  p-schedule  B(t), 

B(t+l),...,B(t+k-l)  individually.   Let  t  =  t+k-1. 

Go  to  Step  7- 

(1-iii)      Otherwise  let  B' (t+m)   =  B(t+m)   U   {N}    where  N  e   C(t-fm) 

t       m_1 
and  has  the  minimum  d-CN).      Let  B'      =     U  B(t+i)   U 


B'(t+m).      Then  p-line   schedule  B'     by  a  maximum 

'  r  m 
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p-line  set  L  .   Let  t  =  t  +  m.   Go  to  Step  7* 

t 
(2)  B^  is  p-connectable. 

(2-i)  Find  a  p-line  set  L.  . 

(2-ii)  p-line  schedule  B  by  L  . 

(2-iii)  t  =  t  +  k. 
(2-iv)  Go  to  Step  7. 
Step  7:   t  =  t  +  1. 

Go  to  Step  2. 

Proof  for  the  Algorithm: 

That  the  above  algorithm  is  optimum  can  be  proved  by  a  similar 
argument  used  to  prove  the  previous  algorithm.   For  example,  we  can  show  that 
Steps  3,5-1,6-1,6-2-2  and  6-2-3(2)  are  optimum  easily  by  previous  Lemmas.   Now 
we  have  to  show  that  Steps  5-2  and  6-2-3 (l)  are  optimum. 

Lemma  11: 

Step  5-2  is  optimum. 

Proof: 

Suppose  that  we  do  not  use  N  in  C(t)  to  fill  up  otherwise  wasted 
processor  time,  where  N  is  the  node  with  the  minimum  cL(n)  in  C(t).   Saving  N 

for  later  use,  however,  does  not  improve  the  situation  because  a  node  N  cannot 
be  used  more  effectively  than  to  fill  up  wasted  processor  time  anyway. 

Also  the  choice  of  N  from  C(t)  is  optimum.   Suppose  for  example  that 
N'  with  cL(N')  >  cL(N)  is  also  in  C(t).   Now  let  us  consider  those  two  nodes 
(See  Figure  7-12).  Whether  we  use  N  or  N'  to  schedule  with  B(t)  will  not  make 
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any  difference  to  schedule  B(t+l),B(t+2), . .  ,,B(d_(N))  because  a  node  N  or  N' 

is  available  if  necessary.   Suppose  we  used  N1  with  B(t).   Then  it  is  not 
possible  to  fill  a  later  request  -which  may  arise  when  B(u)  (d_(N)  +  1  <  u  < 

dL(N' ))  is  scheduled,  whereas  if  we  used  N  with  B(t)  then  we  can  fill  the 

request.   Thus  this  proves  the  lemma. 


djCN') 


Figure  7.11.     An  Illustration  for  Lemma  13 


(Q.E.D. ) 


That  Step  6-2-3 (l)  is  optimum  is  proved  similarly. 
Now  we  have  the  following  theorem. 


Theorem  2 


The  algorithm  gives  an  optimum  schedule  of  a  loose  graph. 
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7 • 5  Supplement 

(1)  An  algorithm  to  establish  A  (p)  on  A  is  now  discussed, 
v       *©  n       n 

n 
Let  m  =  L   |A(t+i) \,   and  B  be  an  m  x  m  connection  matrix  where  the 
i=0 

first  |A(t)|  columns  and  rows  are  labeled  by  nodes  in  A(t),  the  second  |A(t+l) | 

columns  and  rows  by  nodes  in  A(t+l),  and  so  forth.  An  element  b.  .  of  B  is  1  if 

and  only  if  N.  -*  N .  where  N.  and  N.  are  labels  for  the  i-th  row  and  the  j-th 

column. 

Now  define  the  multiplication  of  the  connection  matrices  as  follows. 

Let  A  =  B  x  C  where  A,  B  and  C  are  m  x  m  connection  matricies.  Then  a.  .  = 

ij 

m  -     m  ,      -   — 

V  (b   *  C   ).   Now  complete  Bm  =  V  B  .   In  Bm,  b?.  =  1  implies  that  N.  >N. 
k=l  lK    *J  k=l  1J  x    J 

and  b.  .  =0  implies  that  (N.,N.)  •   For  example,  let  us  consider  the  graph  in 
J- j  -J-  J 

Figure  7-13»  Then  we  have 

A^  =  { (a,f),  (a,g),  (b,e),  (b,h),  (b,g),  (c,d),  (d,h),  (d,j),  (f,h),  (g,h), 
(g,i)). 

(2)  Given  A  (p)  for  A  ,  an  algorithm  to  find  p-connectivity  and  L  (l)  is 
described. 

According  to  Lemma  k,    if  A  is  p-connectable,  then  there  is  a 

p-line  (l)  set  L  (l).   Thus  to  check  p-connectability  it  is  enough  to  examine 

if  there  is  a  p-line  (l)  set  L  (l). 

n 


226 


(a) 


a 

b 

c 

d 

e 

f 

g 

h 

i 

J 

a 

1 

1 

b 

1 

1 

c 

1 

1 

l 

d 

1 

e 

1 

1 

1 

f 

1 

1 

g 

1 

h 

i 

•i 

a 

b 

c 

d 

e 

f 

e 

,1 

a 

1 

1 

i 

i 

b 

1 

1 

i 

c 

1 

1 

1 

1 

i 

d 

i 

e 

i 

f 

• 

i 

g 

i 

h 

\ 

K — 

i 

\ 

. 

J 

\ 

(b)  B 


(c)  Bc 


t/t, 


Figure  7.12.  An  Example  for  A  (  ) 
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First  let 

A*(p)(l)  =  a£(P)  -  C(N,Nf)|  I^CN)  -  d^N')!  >  1} 

— t 
Now  we  construct  a  graph  A  as  follows. 

n 

(1)  A  has  following  nodes: 

(i)  an  initial  node  N  , 

s 

(ii)  a  terminal  node  N_, 

.   .  t 

(iii)  all  nodes  in  A  , 

(iv)   for  each  node  N  (N  €  A  ,  N  /  A(t),  N  /  A(t+n))  a  new 

duplicate  node  N' . 

— t 

(2)  A  has  following  verticies: 

n 

(i)  a  vertex  from  N  to  every  node  N  which  was  in  A(t), 

(ii)  a  vertex  from  every  node  N  which  was  in  A(t+n)  to  N„, 

iii 

(iii)  if  (N^Ng)  e  A*(p)(l),  then  a  vertex  from  V     to  N  . 

(iv)   for  every  A(t+k),l  <  k  <  n-1,  for  every  N  e  A(t+k),  a  vertex 
from  N  to  every  N'  which  is  a  duplicate  of  N"  e  A(t+k)  but 
is  not  identical  to  N. 
To  illustrate  the  above  definition,  let  us  consider  the  following 

example . 


Example : 


Let  A2  =  {a,b,c,d,e,  f,g} 
where  a,  b  e  A(t) 
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c,  d,  e  e  A(t+l) 
and  f,g  €  A(t+2). 
Further  let 

a£(P)(1)  =  {(b,c),(d,f),(d,g),(c,f)} 


Figure  "J. 13-     An  Example  for  p-connectivity  Discovery 


Then  A  has  nodes  =  {N  ,  N_,  a,  b.c,  d,  e,  f,  g  (which  are  nodes  in  A  ),  c*,  d',e' 
(which  are  duplicates  of  nodes  in  A(l))}  and  verticies  =  { (N  ,a),(N  ,b),  (f,N_), 

(g^N-,),  (b,c),  (d,f ),  (d,g),  (c,f  )(which  are  original  verticies  in  A  ),  (c,d!  ),  (c,  e' ), 

hj  n 

(d,  c' ),  (d,  e' ),  (e,c' ),  (e,  d' ) (which  are  verticies  from  N  to  a  duplicate  N'  of  N" 
which  is  not  identical  to  N).] 

Now  it  is  clear  that  A  has  a  p-line  (l)  set  L  (l)  if  and  only  if 

n  n 

there  is  a  path  from  N  to  N  in  A  .   There  is  a  well-known  algorithm  for  path 

finding,  e.g.  f 19] . 
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Figure  7-13.      (continued) 
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8.   CONCLUSION 

This  thesis  introduced  new  techniques  to  expose  hidden  parallelism 
in  a  program.  Techniques  included  the  use  of  one  of  the  fundamental  arithmetic 
laws,  i.e.,  the  distributive  law,  extensively.  Furthermore  it  was  suggested 
that  with  the  help  of  these  techniques  computation  of  a  program  might  be 
speeded  up  logarithmically  in  the  sense  that  computation  time  became  a  log- 
arithmic function  of  the  number  of  single  variable  occurrences  in  a  program 
rather  than  its  linear  function.   Even  though  discussions  were  based  on  an 
ILLIAC  IV  type  machine,  as  mentioned  before,  they  are  readily  applicable  to 
pipe-line  machines  such  as  CDC  STAR. 

Chapter  2  of  the  thesis  studied  parallel  computation  of  summations, 
powers  and  polynomials.  The  minimum  time  to  evaluate  summations  or  powers  as 
well  as  the  minimum  number  of  PE's  required  to  attain  it  was  given.  A  scheme 
which  computed  a  polynomial  in  parallel  in  lesser  time  than  any  known  scheme 
was  also  introduced.   Because  of  its  simplicity  in  scheduling,  the  k-th  order 
Horner's  rule  for  parallel  polynomial  computation  was  studied  in  detail.   It 
was  shown  that  for  this  algorithm  the  availability  of  more  PE's  sometimes 
increased  the  computation  time.   The  algorithm  was  such  that  all  PE's  were 
forced  to  participate  in  computation. 

Chapter  3  presented  an  algorithm  which  reduced  tree  height  for  an 
arithmetic  expression  by  distribution.  The  algorithm  worked  from  the  inner 
most  parenthesis  pair  to  the  outer  most  one  and  scanned  an  arithmetic  expression 
only  once.  A  measure  for  the  height  of  the  minimum  height  tree  for  an  arith- 
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metic  expression  was  given  as  a  function  of  the  depth  of  parenthesis  pair 
nesting  and  the  number  of  single  variable  occurrences  in  it. 

Chapter  h   extended  the  above  idea  to  cover  a  sequence  of  arithmetic 
expressions.  It  was  shown  that  by  replacing  a  sequence  of  arithmetic  expres- 
sions with  an  arithmetic  expression  by  back  substitution,  the  computation 
time  could  be  speeded  up  in  a  logarithmic  way  for  a  certain  class  of  iteration 
formulas,  e.g.,  x.  .  :=  a  x  x.  +  b  .  The  chapter  also  showed  that  parallel 
computation  was  in  general  more  favorable  than  sequential  computation  in  terms 
of  the  round  off  error.  Furthermore  it  was  shown  that  distribution  would  not 
introduce  the  significant  amount  of  the  round  off  error. 

Chapter  5  studied  inter- statement  parallelism  as  an  introduction 
to  the  following  chapter.  An  algorithm  which  checked  if  the  execution  of 
statements  in  a  program  by  some  sequence  gave  the  same  results  as  the  execution 
of  statements  by  the  given  sequence  did  was  given.   The  algorithm  was  new  in 
the  sense  that  it  prevented  variables  from  being  updated  before  they  were  used. 
This  had  not  been  taken  into  account  by  the  previous  works.  Also  a  technique 
which  exploited  more  parallelism  between  statements  by  introducing  temporary 
locations  was  introduced. 

Chapter  6  presented  an  algorithm  which  checked  if  a  statement  in  a 
loop  could  be  executed  simultaneously  for  all  values  of  a  loop  index.   The 
algorithm  checked  index  expressions  and  the  way  the  values  of  indices  varied 
only  and  did  not  require  a  loop  to  be  replaced  with  a  sequence  of  statements. 
In  case  a  statement  in  a  loop  could  not  be  executed  in  parallel  with  respect 
to  a  loop  index  as  it  was,  the  algorithm  "skewed"  the  computation  of  a  state- 
ment with  respect  to  a  loop  index  so  that  the  statement  could  be  executed  in 
parallel  for  all  values  of  the  loop  index.  Also  to  expose  hidden  parallelism 
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from  a  loop,  replacement  of  a  loop  with  several  loops  was  discussed. 

A  solution  for  the  equally  weighted- two  processor  scheduling  problem 
was  given  in  Chapter  7«  The  only  practical  work  so  far  obtained  was  a  result 
for  scheduling  a  rooted  tree  with  equally  weighted  tasks  on  k  identical 
processors.  The  solution  given  in  Chapter  7  scheduled  a  graph  with  equally 
weighted  tasks  on  two  identical  processors.  If  we  considered  common  expres.- 
sions  in  an  arithmetic  expression  then  we  would  obtain  a  graph  of  operations 
rather  than  a  tree  for  the  arithmetic  expression  and  the  scheduling  algorithm 
was  readily  applicable  for  scheduling  that  graph  on  P(2) . 

Suggestions  for  further  research  have  been  given  in  several  places 
throughout  the  thesis  and  need  not  be  repeated  here.  We  conclude  by  giving 
two  possible  extensions  that  deserve  brief  mention. 

(1)  The  design  of  a  better  machine. 

Even  though  we  assumed  that  a  PE  can  communicate  with  any 
other  PE  instantaneously,  this  may  not  be  the  case  in  reality 
because  it  is  costly  and  impractical  to  provide  data  paths 
between  every  PE  pair.  Hence  it  is  necessary  to  design  PE 
interconnection  which  is  economical  yet  powerful  enough  to 
simulate  the  above  idealized  interconnection  [25],  [26]. 

(2)  Generalization  of  the  idea  given  in  this  thesis. 

The  three  laws  of  arithmetic  were  utilized  in  this  thesis 
in  terms  of  parallelism  exploitation.  We  should,  however,  pay 
more  attention  on  these  laws  even  in  terms  of  serial  computation. 
For  example  suppose  an  arithmetic  expression  which  involves 
matrices,  row  and  column  vectors  is  given.   Then  by  the  appro- 
priate application  of  the  associative  law,  the  number  of  multi- 
plications required  may  be  reduced  drastically. 
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